๐Ÿ“œ โฌ†๏ธ โฌ‡๏ธ

Linguistic riddle. We translate from the "dead" language. [ยง2] Debriefing

This is a continuation, or rather the answer to the problem from the article " Linguistic riddle. We translate from a dead language ."

There is no catastrophic time, but as we know it never happens, and since I promised, I have to write an article. Once again, I apologize for being late.

Answer

For those who are impatient, the answer, which by the way at the time of writing this article, except for a single person (not from a habr), was completely unraveled by anyone. But more on that below ...
The well-known phrase about โ€œGlossy Kuzdraโ€ (hello AndreyDmitriev ), which is a quote from Ouspenskyโ€™s book โ€œA Word about Wordsโ€, was insidiously placed by me in the middle of the text. The rest, as already mentioned, was finished off by others from the same topic, and a little even in the โ€œOld Russianโ€ manner ...
Do not utter the murder of the Glock upon the booster of your side, for the prosperous will tremble, and the new ones shall shine. the glosse kzdra shteko boshanula bokra and kurdyachit bokryonka. Yes, there will not be a cozdra side by the side, and a cobra side by the kuzdra will be in front of my face, for Iโ€™ve frightened me terribly, and Iโ€™m so budlanuto.

There have been many attempts from different people in the right direction and not so much. It was very interesting for your humble servant to follow the โ€œguessingโ€ process itself. On this, all participants thank you so much.
After my examples, several people discovered a tickle for themselves or decided to learn it more seriously (a special hello to the "recruits"), which, as a developer, cannot but please me!
In general, for my part continuous positive ...
It seems that if I were offered a distinguished job, the interest in the task would have been much higher, but unfortunately we did not have any vacancies at that time, so there is something.
')
A simple example of machine translation (TCL in three lines)

There are very simple TCL sources with dictionaries for converting to and fro. They are only to show the efficiency of the โ€œmachineโ€ translation (in three lines), but as examples they are not entirely capable. Here, in order to apply syllable and symbolic substitutions that distinguish between prefixes and suffixes, spaces, dots and commas were used in the dictionaries. Such a dictionary is rather cumbersome and not convenient for parsing, so Iโ€™ll give the algorithms below more complicated, but easier to understand the dictionary .
Tickle is under linux as a rule already out of the box (tclsh), under windows (after installing tcl) instead of tclsh it is better to use wish because of multiple encodings). If there are willing examples in python, I will rewrite it for him and put it in the post.

Actually debriefing


I am not a linguist in the literal sense of this concept, but first, I speak and write fluently in several languages โ€‹โ€‹from different language groups (and in some of them, as in my own). Secondly, the work involves the analysis and analysis of texts, including multilingual ones (search engines, recognition, indexing, etc., etc.). Thirdly, again, partially by the nature of the activity, I perfectly understand how some languages โ€‹โ€‹evolved and changed over time. In addition, I am a programmer, i.e. Transformations from one to another, at least in a โ€œmachineโ€ way, even in my head, are also inherent to me to a certain extent.

As an initial example, let's take some โ€œtransformationsโ€ that occurred with individual languages โ€‹โ€‹a long time ago:
What I would like to clarify here: I can not say, in contrast to the linguist, in what sequence and what has changed here, but I have a purely practical concept of this morphology, so to speak, โ€œequation mathโ€.

Grammatical modifications of the syntax of sentences during translation I did not consider in principle, i.e. in the text, everything was โ€œtranslatedโ€ literally. Itโ€™s just that with full modification, for example, permutation of words, the chances of recovering the Russian text from such a small passage were tending to zero. Compare, for example, even such close languages โ€‹โ€‹of one language group as Danish and German.

The literal translation (morphologically correct, but syntactically worthless):
[DK] du behรธver ikke komme, hvis du har bedre ting for
[DE] Du brauchst nicht kommen, falls du hast bessere Dinge vor
Correct translation into German:
[DK] du behรธver ikke komme, hvis du har bedre ting for
[DE] Du brauchst nicht zu kommen, falls du etwas besseres vorhast

Therefore, I repeat, the translation is carried out literally, without changing the order of words - i.e. remains the characteristic Russian language syntax.

We define our language as a little โ€œsnarlingโ€ and โ€œhootingโ€ and make some letters in certain places โ€œhard to hearโ€ or unreadable. Slightly complicate the morphology of writing the language, and the words in it will be slightly longer than in Russian. What we succeed should be reminiscent of Russian with an admixture of Scandinavian (and possibly other Germanic) languages โ€‹โ€‹on the one hand, and on the other hand something from the languages โ€‹โ€‹of the Turkic group.
So let's get started ...

Translate to the "dead" language

In the original, I translated at once using the symbols of the Georgian alphabet, but people in the comments made it clear that the โ€œhieroglyphโ€ and the โ€œhieroglyphโ€ are different (difficult to understand) - so we will do the translation while in Cyrillic, only at the end using the Georgian alphabet.

For a start, a small subroutine allows you to change syllables and letters simply and with taste, while โ€œdistinguishingโ€ the beginning and end of words.

The magic routine [Translate]
 #      : proc magic_text {args} { set text [lindex $args end] foreach {op val} [lrange $args 0 end-1] { switch -- $op \ -regexp { foreach {re val} $val { regsub -nocase -all $re $text $val text } } \ -map { set text [string map -nocase $val $text] } \ -default { error "uknown operation '$op'" } } return $text } #  ,    ,     : # ... ^$, ^$ ... -       proc Translate {dictMap encText} { magic_text -regexp {{(\m[^\s[:punct:]]+\M)} {^\1$}} -map $dictMap -map {^ "" $ ""} $encText } 

Trial of the pen - here using a simple dictionary, we will try to rewrite the phrase โ€œMom washed the frameโ€ in the plural, and instead of โ€œframeโ€ there would be โ€œRomaโ€.
 #   (^ -  , $ - ): % Translate {^  $  $  $ "" ^ } "  ,    ."   ,    . 

Actually we proceed to the creation of a dictionary for our "dead" language.

 #   set RuXy {                                                           $  $  $  $  $  $  $  $  $  } # 1   puts 1:[set ru "       ,   ,   .        .      ,       ,    ,    ."] # 2  : puts 2:[Translate $RuXy $ru] 

Result of execution (translation):
 1:       ,   ,   .        .      ,       ,    ,    . 2:       ,   ,   .        .      ,       ,    ,    . 

By the way, if you introduce some conventions when reading (for example, "p" and "x" in the end almost always are not pronounced), then you can immediately read the entire sentence almost as in Russian.
The goal has been achieved, now we simply impose the Georgian alphabet, observing vowels and consonants (in order not to completely complicate things).

Below is a script that uses the Georgian alphabet, and translates back and forth:
 #   set RuXz { แƒกแƒข  แƒกแƒชแƒšแƒ’  แƒกแƒชแƒข  แƒจแƒช  แƒ’แƒ˜แƒšแƒ   แƒคแƒแƒ’  แƒ แƒ”แƒแƒš  แƒžแƒแƒ’แƒ“แƒš  แƒžแƒ”แƒฃแƒš  แƒžแƒ”แƒแƒš  แƒ–แƒ“  แƒžแƒ’แƒšแƒแƒ”  แƒ–แƒ’แƒ  แƒแƒ’แƒ”แƒšแƒ”แƒ  แƒ™แƒ”แƒšแƒ”แƒ  แƒ™แƒ’แƒšแƒ”แƒ’  แƒ™แƒ”แƒแƒšแƒ”แƒ’  แƒ™แƒ”แƒฃแƒ’แƒš  แƒ™แƒ”แƒแƒš  แƒ’แƒขแƒแƒ’  แƒแƒ’แƒš  แƒ›แƒ”แƒฃแƒ’แƒš  แƒ”แƒ“แƒ’แƒ”แƒ’  แƒแƒ“แƒ’  แƒฃแƒšแƒ“แƒ’  แƒ‘แƒšแƒ’  แƒแƒšแƒ”แƒแƒ’  แƒ”แƒšแƒ”แƒแƒ’  แƒ”แƒฃแƒ’แƒš $ แƒ”แƒš $ แƒ”แƒแƒ’แƒš $ แƒแƒš $ แƒ˜แƒš $ แƒแƒ’ $ แƒฃแƒ’ $ แƒœแƒแƒ’ $ แƒฎแƒ’ $ แƒ›แƒ”แƒแƒ’  แƒ”แƒ  แƒจแƒฉ  แƒ”แƒ’  แƒ”แƒฃ  แƒ”แƒ  แƒ  แƒž  แƒ•  แƒฎ  แƒ“  แƒ˜  แƒ–  แƒŸ  แƒ”  แƒ”  แƒ™  แƒ   แƒœ  แƒ›  แƒ  แƒ‘  แƒš  แƒก  แƒข  แƒฃ  แƒค  แƒ’  แƒช  แƒฉ  แƒจ  _  แƒ”  แƒ˜} #   set XzRu {แƒกแƒข  แƒกแƒชแƒšแƒ’  แƒกแƒชแƒข  แƒจแƒช  แƒ’แƒ˜แƒšแƒ   แƒคแƒแƒ’  แƒ แƒ”แƒแƒš  แƒžแƒแƒ’แƒ“แƒš  แƒžแƒ”แƒฃแƒš  แƒžแƒ”แƒแƒš  แƒ–แƒ“  แƒžแƒ’แƒšแƒแƒ”  แƒ–แƒ’แƒ  แƒแƒ’แƒ”แƒšแƒ”แƒ  แƒ™แƒ”แƒšแƒ”แƒ  แƒ™แƒ’แƒšแƒ”แƒ’  แƒ™แƒ”แƒแƒšแƒ”แƒ’  แƒ™แƒ”แƒฃแƒ’แƒš  แƒ™แƒ”แƒแƒš  แƒ’แƒขแƒแƒ’  แƒแƒ’แƒš  แƒ›แƒ”แƒฃแƒ’แƒš  แƒ”แƒ“แƒ’แƒ”แƒ’  แƒแƒ“แƒ’  แƒฃแƒšแƒ“แƒ’  แƒ‘แƒšแƒ’  แƒแƒšแƒ”แƒแƒ’  แƒ”แƒšแƒ”แƒแƒ’  แƒ”แƒฃแƒ’แƒš  แƒ”แƒš$  แƒ”แƒแƒ’แƒš$  แƒแƒš$  แƒ˜แƒš$  แƒแƒ’$  แƒฃแƒ’$  แƒœแƒแƒ’$  แƒฎแƒ’$  แƒ›แƒ”แƒแƒ’$  แƒ”แƒ  แƒจแƒฉ  แƒ”แƒ’  แƒ”แƒฃ  แƒ”แƒ  แƒ  แƒž  แƒ•  แƒฎ  แƒ“  แƒ˜  แƒ–  แƒŸ  แƒ”  แƒ”  แƒ™  แƒ   แƒœ  แƒ›  แƒ  แƒ‘  แƒš  แƒก  แƒข  แƒฃ  แƒค  แƒ’  แƒช  แƒฉ  แƒจ  _  แƒ”  แƒ˜ } # 1   puts 1:[set ru "       ,   ,   .        .      ,       ,    ,    ."] # 2      $RuXz puts 2:[set xz [Translate $RuXz $ru]] # 3      $XzRu puts 3:[set ru2 [Translate $XzRu $xz]] 

Well, the result of the execution (translation):
 1:       ,   ,   .        .      ,       ,    ,    . 2:แƒ›แƒ”แƒแƒ’ แƒžแƒ’แƒšแƒแƒ”แƒ–แƒ’แƒแƒกแƒ”แƒš แƒžแƒแƒ’แƒ“แƒšแƒ แƒ”แƒแƒšแƒ›แƒ”แƒšแƒ”แƒแƒ’ แƒ’แƒ˜แƒšแƒ แƒแƒ’แƒšแƒแƒฎแƒแƒ’ แƒ›แƒแƒš แƒ™แƒ”แƒฃแƒ’แƒšแƒ–แƒ“แƒšแƒ˜แƒ›แƒ”แƒ˜แƒš แƒžแƒ”แƒแƒšแƒ™แƒ’แƒšแƒ”แƒ’ แƒกแƒคแƒแƒ’แƒ˜แƒฎแƒแƒ’, แƒ”แƒžแƒ”แƒแƒš แƒคแƒแƒ’แƒกแƒชแƒšแƒ’แƒ˜แƒ‘แƒ˜แƒจแƒฉแƒฃแƒšแƒ“แƒ’ แƒžแƒแƒ’แƒ“แƒšแƒ แƒ”แƒแƒšแƒ”แƒฃแƒ’แƒšแƒจแƒฉแƒ”แƒ˜แƒš, แƒ“แƒแƒš แƒคแƒแƒ’แƒจแƒชแƒ”แƒšแƒ”แƒแƒ’แƒ”แƒฃแƒ’แƒšแƒข แƒžแƒแƒ’แƒ“แƒšแƒ แƒ”แƒแƒšแƒ˜แƒœแƒ”แƒ’แƒ˜แƒš. แƒ’แƒ˜แƒšแƒ แƒแƒ’แƒ”แƒšแƒ”แƒ แƒ™แƒ”แƒฃแƒ’แƒšแƒ–แƒ“แƒšแƒแƒš แƒกแƒขแƒ˜แƒ™แƒ”แƒแƒš แƒžแƒแƒ’แƒ“แƒšแƒ แƒ”แƒแƒšแƒ›แƒ”แƒฃแƒ’แƒšแƒ แƒ”แƒแƒš แƒžแƒ”แƒแƒšแƒ™แƒ’แƒšแƒ”แƒ’ แƒ”แƒš แƒ™แƒ”แƒฃแƒ’แƒšแƒšแƒ“แƒ”แƒแƒฉแƒ”แƒ“แƒ’แƒ”แƒ’ แƒžแƒ”แƒแƒšแƒ™แƒšแƒ”แƒแƒ›แƒ™แƒ”แƒแƒšแƒ”แƒ’. แƒ“แƒแƒš แƒ›แƒ”แƒแƒ’ แƒžแƒแƒ’แƒ“แƒšแƒ˜แƒข แƒฃแƒ’ แƒžแƒ”แƒแƒšแƒ™แƒ’แƒšแƒ”แƒ’ แƒ™แƒ”แƒฃแƒ’แƒšแƒ–แƒ“แƒšแƒ”แƒ’, แƒแƒš แƒฃแƒ’ แƒ™แƒ”แƒฃแƒ’แƒšแƒ–แƒ“แƒšแƒ”แƒ’ แƒžแƒ”แƒแƒšแƒ™แƒ’แƒšแƒ”แƒ’ แƒ‘แƒšแƒ’แƒ˜แƒ“ แƒ แƒ”แƒชแƒ˜แƒœแƒแƒ’ แƒœแƒแƒ”แƒœแƒแƒ’, แƒ”แƒžแƒ”แƒแƒš แƒ”แƒแƒ’แƒš แƒกแƒชแƒšแƒ’แƒแƒจแƒ›แƒแƒ’ แƒžแƒแƒ’แƒ“แƒšแƒ แƒ”แƒแƒšแƒ›แƒ”แƒฃแƒ’แƒšแƒ , แƒ”แƒš แƒœแƒ›แƒแƒ”แƒฃแƒ’แƒš แƒขแƒแƒฎแƒ’ แƒžแƒแƒ’แƒ“แƒšแƒ แƒ”แƒแƒšแƒ›แƒ”แƒฃแƒ’แƒšแƒ’แƒขแƒแƒ’. 3:       ,   ,   .        .      ,       ,    ,    . 


Translation from the "dead" language

Before publishing the article-problem, I checked on myself the possibility of finding a solution. Of course, this is far from a pure test (after all, the result is known to me), but I approached the analysis of morphology purely โ€œscientificโ€, using some tools and initially translating a completely different passage with my dictionary, taking a piece of text with a random position, unknown to me, from another work and working In most cases only with the Georgian alphabet (which is also a stranger to me by the way).

I must note that it seemed to me almost impossible to make a translation fully automatically or completely manually. In my understanding, I and the computer will work constantly in the coupling.

As a result, a brief script for reverse translation, as I did:

  1. Computer analysis of the text (the proprietary VDK was used, self-written filters for it, Russian stereometers and a lot of scripts, for example, for binding the diff). Initially empty alphabet translation dictionary. The result of the analysis is a set of variants of the โ€œwrongโ€ text in Cyrillic, in which everything is in the nominative case, with a universal morphology in an indefinite form.
  2. Manual analysis of these text variants (search for something readable). The goal is to increase the dictionary used in computer analysis. Repeat from step 1. If the readability of the text has become worse, roll back to the previous dictionary.

That's basically all, and now slightly deployed:
UPD (1)
  1. Computer analysis
    The API VDK allows text analysis and processing, up to full-text indexing - an open-source โ€œanalogโ€ would be Apache Lucene (api for Solr) if it could do everything that the VDK can.
    I have long been eager to completely rewrite my work, scripts, filters, and analyzers without using VDK. So far, unfortunately, only partially succeeded.
    Without being distracted by the VDK API, I will still try to deploy my workflow โ€” how everything works.
    • We generate a new dictionary for the Translate (see function above), and launch it - we translate the text, which we then feed to the subsequent steps of the workflow.
    • "Indexing" of the text using samopisny filters and pre-stemmers. As a result, we have parsed the text.
    • We work with words. For this you can use for example accent or dialect filters (with custom rules). Here you can remove the stress, etc. I will show the example of the German Umlauts: the former German Chancellor could be found in the text as โ€œSchrรถderโ€ or as โ€œSchroederโ€. German spelling also changed (old, new), so โ€œCrรจmeโ€ and โ€œKremโ€ or, for example, โ€œSchmantโ€ and โ€œSchmandโ€ are one and the same word.
      Naturally, in our case we have a โ€œpseudo-Russianโ€ accent dictionary. Therefore, we will carry out the inverse action here, which is usually used not for indexing, but for searching, i.e. we expand each word into several similar ones.
    • Next, we set the stemmer. Stemmer allows you to bring the word to a unique form, such as root, indefinite form, nominative case, etc. Smart stemmers, such as those using dictionaries, can delete prefixes or suffixes if the meaning of the word is not lost (that is, the word does not become completely different).
      Naturally purely Russian stemmer is completely inappropriate here. But you can use a custom stemmer, which uses grammatical filters of the Russian language (as far as syntax is concerned) and a new dictionary (initially empty).
    • All actions are shuffled with a script (workflow), passing at intermediate stages through other filters, for example, UniqueFilter.
    • At the end, diff starts, so that at each iteration it is possible to concentrate only on changes in words or to see new roots, etc.

  2. Manual analysis.
    • At each iteration, we look at the changes in the words (the result of the diff) and several variants of the โ€œtranslationโ€ of the text (also diff); we are looking for any clues in the changes, known or similar words.
    • We correct dictionaries of filters, separate rules, rarely workflow itself.
    • Repeat the iteration - run the flow again.

UPD (2)
Since there were questions about what is missing in order to decrypt automatically, in other words โ€œwhy turn on the headโ€ (I exaggerate) and why, in my understanding, a person and a computer should work in pairs here. I will answer here briefly: let us imagine only one situation that exists in many languages โ€‹โ€‹- reading and writing are often not consistent, with their single-letter pronunciation. For example, Iโ€™ll take German again: the EU syllable in German is pronounced OY , respectively, the words EURO and Europa are pronounced by the Germans as Oiro and Oyropa . And now let's imagine how the German word โ€œ my โ€ would have written - I guarantee you that it would be โ€œ meu โ€ and not โ€œ moi โ€ (if only because in German the new word is โ€œ neu โ€, pronounced โ€œ noah โ€) . And this is only one situation that takes place - take Watson here, anyway, thereโ€™s no head without a head!

The person who solved this riddle and translated the text completely, up to a hash coincidence, sent me an answer via e-mail. Later, he described the way he did it, which is fundamentally different from how I approached solving and translation.
Unfortunately, for many reasons (including time), he is not eager to join the habrasoobshchestvo (and I have all the invites so far, I would send it out of luck :). But if suddenly it turns out jointly or maybe he will allow to publish the decision on his behalf, I will do it with great pleasure. His decision, or rather the concept thereof, despite the fact that unfortunately I did not understand the subtleties, because completely not seen, still very, very impressive.
For reasons that were incomprehensible to me, this modest did not want me to mention his name in the article, and also who he really was and where he was from, I, unfortunately, was not dedicated either. Communication style suggests that this person is still very young. Although ... In general, there are still solid secrets of the Madrid court. I can only say unequivocally that he is โ€œRussianโ€ (in today's times it is more correct to speak Russian :).
...
Thanks again to everyone who participated.

Source: https://habr.com/ru/post/232471/


All Articles