📜 ⬆️ ⬇️

A simple way to assess the clarity of the text in Russian

In essence, the published below is my commentary on the publication “What is the“ Clear Russian ”in terms of technology. Let's take a look at the text readability metrics . Since I cannot leave comments, I am writing in the Sandbox .

The criteria for assessing the clarity of the texts that were considered in the post are based on almost zero knowledge of the language in which these texts are written: it suffices to know how it is divided into words and sentences. This approach is convenient in terms of ease of calculation, but does not allow the use of a lot of relevant data. It seems to me that in the case of the Russian language it is obvious that you can still use it, and this data is easily accessible.

In my opinion, incomprehensibility makes sense to divide into two types:

(a) deep incomprehensibility (when it is impossible to make out what is written);
')
(b) incomprehensibility associated with complexity.

The incomprehensibility of type (a), which every second, if not just every, official document is saturated with, is connected with the fact that people simply do not know how to express their thoughts. Something that seems understandable in the head and somehow can be explained “in words” cannot be transferred to paper: the momentum does not close, the anaphora intertwine, the work unites things that are better off not being, and so on. In a pure case, it is difficult to distinguish it automatically from a normal text: often even to people who read the text superficially, it seems that it is more or less nothing, and then it turns out that it is some kind of a whirlpool. Moreover, it is impossible to fix this automatically: first, you have to sit down with the author and have a long time trying to find out from him what he had in mind. But, fortunately, this incomprehensibility almost always entails the incomprehensibility of type (b), so at least it is possible to identify incomprehensible texts.

Incomprehensibility = complexity implies that people use some non-trivial language tools that are poorly understood without education and / or application of remarkable efforts. And here we are confronted with the mediated nature of traditional metrics. Long sentences, of course, are best avoided, but a long sentence as such is not synonymous with darkness: a simple enumeration can make a sentence long, without necessarily making it incomprehensible. The use of long words also does not make the text obviously incomprehensible. In the end, nobody canceled technical language, and it is impossible to convey all the subtleties in simple words, not to mention the fact that in official documents one cannot do without “implementation”, “bringing” and the like of multi-letter things. In other words, if you do not invent new terms all the time, then gradually people will begin to speak the same language.

It seems to me that the complexity of type (b) is primarily syntactic, or rhetorical, complexity. Office is usually characterized by the fact that the phrase analysis tree quickly breaks through the ceiling, and this is typical of almost any "dark" texts. To make the texts more understandable, we need to make them structurally simple. And it is very simple: in the overwhelming majority of cases, the syntactic complexity is achieved through the use of a single means - the participle of the actual voice. Try to write confusing text without active participles, and you will see that it is almost impossible. Either you will have complete absurdity, or the proposals, if necessary, will be shorter - and more understandable. The thesis that Russian people in colloquial speech do not use participles and verbal manifestations is as old as the world. It is not entirely correct - I know people who use communion and diacharacter in their speech, I use them myself - but there is no doubt that first of all it is the identity of the written language and the result of the attempt to write in Russian as Cicero (or any Greek copied people who launched the second South Slavic influence).

I do not claim that this is the only right way to assess the clarity of the text, but I am almost sure that the number of active participles will reveal a complex Russian text no worse than any other single-factor metric. For a test check, I took five texts: “The Captain's Daughter”, “War and Peace”, a separate epilogue to “War and Peace”, famous for its comprehensibility, “Classical and non-classical ideals of rationality” by Merab Mamardashvili (modern philosophical text of the Russian-speaking author) and federal Law “On Education in the Russian Federation”. I divided the texts into sentences and, using Python 3 + pymorphy2, calculated the average number of active participles in each of them. The result was predictable, but still eloquent:



The service proposed in the post gives the following results:



With two attempts, he failed to cope with the full text of “War and Peace” - it would be interesting to know what the matter was there. We see that the ranking in the ranking is the same, but if we measure by the participles, the difference between the Law on Education and the Captain's Daughter, as well as between the epilogue to War and Peace and the text of Mamardashvili is higher. I do not vouch for absolute values, but I suspect that the text of Mamardashvili is more complicated than the text of Tolstoy.

On the other hand, it turns out that the text of Mamardashvili is the most complex of all. The complexity of words can be considered not only by their length, but also by occurrence in the texts. Rare word = difficult. To measure the rarity of the words, I took the frequency data published on the NCRF website , and for each text I made an array, where each word corresponds to a number = 1 / occurrence (i.e., the rarity of the word). In the NCRF table, the rarest words have the occurrence of 3, so if there were no words in the table, it received a rarity 1/2. Then I calculated the average dictionary rarity for all texts. In this ranking, “War and Peace” completely surpassed the epilogue (there is no French there), and even higher were “The Captain's Daughter” (many non-trivial spellings), the Law on Education and, with a margin, “Ideals”. This is a slightly crooked result, but it shows how specific a text Mamardashvili has. If we multiply the data on the participles and the data on the words, we get the following rating, in my opinion, very meaningful:

Source: https://habr.com/ru/post/239511/


All Articles