Evolution of the Test The Text algorithm

Test The Text highlights the stop words in the text. Stop words make the text harder, weaker, longer.

Stop words are divided into several categories:
- modal verbs;
- empowering and generalized definitions and adverbs;
- cliches and clericals;
- hyperonyms;
- parasites of time;
- verbal nouns;
- passive voice;
- adverbs;
- participial momentum.

The prototype singled out modal verbs using the list “to be able”, “must” and “need” in all forms:
')

'modal': { 'can': u""", , , , , , , , , , , , , , , """, 'need': u', , , , ', 'should': u', , , ', 'other': u', , , , ' },

The text was divided into words by a regular expression (?: [\ S,.:] | \ A | \ Z), each word was compared with stop words. Matches were marked in the source text <span class = "class of stop words">, the marked text was returned back and replaced the text on the page.

The prototype worked, the list of stop words grew, but when I got to the cliche, I realized that I would not be able to further list stop words in all forms. Each stop word turned into 63. Three kinds × three times × seven cases. I connected the pystemmer .

Stemmer removes the ending and suffix of the word, leading it to normal form.

  →   →   →   →   →

Pystemmer works on the Snowball algorithm.

Now the algorithm went over the stop words, discarding the endings and suffixes, then the words of the text. The initial forms of stop words and words in the text were compared. Stop phrases, like, “dubious pleasure,” the algorithm breaks down into words, stimmiruet and collects back. When searching for a stop phrase in the word list of the text, the word to be checked is taken and a few words follow it.

Unfortunately, it is impossible to verify adverbs, passive voice, and sacrament turns through dictionaries of stop words. Imagine a list of all the dialects of the Russian language? It's time to connect the morphological analyzer. There is no choice for python besides pymorphy2 from kmike , so I ’ll stop on it.

A morphological analyzer defines for a word a part of speech (noun, verb, adjective, ...), gender, singular / plural, case, person, tense, voice for verbs. Full list in source code . A fascinating article on how pymoprhy2 works.

 [Parse(word=u'', tag=OpencorporaTag('PRTS,perf,past,pssv masc,sing'), normal_form=u'', score=1.0, methods_stack=((<DictionaryAnalyzer>, u'', 745, 71),))] [Parse(word=u'', tag=OpencorporaTag('ADJF plur,nomn'), normal_form=u'', score=0.212962962962963, methods_stack=((<DictionaryAnalyzer>, u'', 162, 20), (<UnknownPrefixAnalyzer>, u''))), Parse(word=u'', tag=OpencorporaTag('ADJF inan,plur,accs'), normal_form=u'', score=0.212962962962963, methods_stack=((<DictionaryAnalyzer>, u'', 162, 24), (<UnknownPrefixAnalyzer>, u''))), Parse(word=u'', tag=OpencorporaTag('PRTF,impf,tran,pres,actv plur,nomn'), normal_form=u'', score=0.212962962962963, methods_stack=((<DictionaryAnalyzer>, u'', 1609, 33), (<UnknownPrefixAnalyzer>, u''))), Parse(word=u'', tag=OpencorporaTag('PRTF,impf,tran,pres,actv inan,plur,accs'), normal_form=u'', score=0.212962962962963, methods_stack=((<DictionaryAnalyzer>, u'', 1609, 37), (<UnknownPrefixAnalyzer>, u''))), Parse(word=u'', tag=OpencorporaTag('PRTF,impf,intr,pres,actv plur,nomn'), normal_form=u'', score=0.03703703703703704, methods_stack=((<FakeDictionary>, u'', 1670, 33), (<KnownSuffixAnalyzer>, u''))), Parse(word=u'', tag=OpencorporaTag('PRTF,impf,intr,pres,actv inan,plur,accs'), normal_form=u'', score=0.03703703703703704, methods_stack=((<FakeDictionary>, u'', 1670, 37), (<KnownSuffixAnalyzer>, u''))), Parse(word=u'', tag=OpencorporaTag('PRTF,impf,tran,pres,actv plur,nomn'), normal_form=u'', score=0.03703703703703704, methods_stack=((<FakeDictionary>, u'', 2631, 33), (<KnownSuffixAnalyzer>, u''))), Parse(word=u'', tag=OpencorporaTag('PRTF,impf,tran,pres,actv inan,plur,accs'), normal_form=u'', score=0.03703703703703704, methods_stack=((<FakeDictionary>, u'', 2631, 37), (<KnownSuffixAnalyzer>, u'')))]

After connecting the morphological analyzer, Test The Text highlights adverbs, passive voice, sacramental turn and, at the same time, interjections.

It remains to deal with problems on the client. I had no idea that there could be so many problems with a simple text entry field. I had to write a java-script to clear the text when inserting and the insertion code <br> by pressing enter. In the new version, I replaced my code with Wysihtml5 . Wysihtml5 is a lightweight html editor. Found it, studying how did the editor in Basecamp.

In addition, I had to transfer the text markup to the client. Verification of the text is not instantaneous, the user could enter a couple more sentences before answering. And since the text was laid out on the server, all user changes were erased.

Instead, the server began to return a list of stop words with their class, starting and ending position. And the markup has already happened on the client. If the word on the position does not match the response from the server, the word is not marked. In case the user has changed the text and the position of the word has shifted.

Development plans:
- Highlighting sentences of more than 17 words, they are difficult to read.
- Select paragraphs longer than 8 lines. Most likely such paragraphs need to be broken.
- Tracking the " rhythm " of the text. In a good text, long sentences alternate with medium and short, the text becomes not monotonous. The reader does not fall asleep.
- The repetition of words in adjacent sentences.
- Subjunctive mood.
- Public API.

Subscribe to our Habra blog and useful letters . We write about the information style, we analyze other people's posts.

Source: https://habr.com/ru/post/204898/

All Articles

Evolution of the Test The Text algorithm

More articles: