Linguistic riddle. We translate from the "dead" language. [§2] Debriefing

This is a continuation, or rather the answer to the problem from the article " Linguistic riddle. We translate from a dead language ."

There is no catastrophic time, but as we know it never happens, and since I promised, I have to write an article. Once again, I apologize for being late.

Answer

For those who are impatient, the answer, which by the way at the time of writing this article, except for a single person (not from a habr), was completely unraveled by anyone. But more on that below ...
The well-known phrase about “Glossy Kuzdra” (hello AndreyDmitriev ), which is a quote from Ouspensky’s book “A Word about Words”, was insidiously placed by me in the middle of the text. The rest, as already mentioned, was finished off by others from the same topic, and a little even in the “Old Russian” manner ...

Do not utter the murder of the Glock upon the booster of your side, for the prosperous will tremble, and the new ones shall shine. the glosse kzdra shteko boshanula bokra and kurdyachit bokryonka. Yes, there will not be a cozdra side by the side, and a cobra side by the kuzdra will be in front of my face, for I’ve frightened me terribly, and I’m so budlanuto.

There have been many attempts from different people in the right direction and not so much. It was very interesting for your humble servant to follow the “guessing” process itself. On this, all participants thank you so much.
After my examples, several people discovered a tickle for themselves or decided to learn it more seriously (a special hello to the "recruits"), which, as a developer, cannot but please me!
In general, for my part continuous positive ...
It seems that if I were offered a distinguished job, the interest in the task would have been much higher, but unfortunately we did not have any vacancies at that time, so there is something.
')

A simple example of machine translation (TCL in three lines)

There are very simple TCL sources with dictionaries for converting to and fro. They are only to show the efficiency of the “machine” translation (in three lines), but as examples they are not entirely capable. Here, in order to apply syllable and symbolic substitutions that distinguish between prefixes and suffixes, spaces, dots and commas were used in the dictionaries. Such a dictionary is rather cumbersome and not convenient for parsing, so I’ll give the algorithms below more complicated, but easier to understand the dictionary .
Tickle is under linux as a rule already out of the box (tclsh), under windows (after installing tcl) instead of tclsh it is better to use wish because of multiple encodings). If there are willing examples in python, I will rewrite it for him and put it in the post.

There:

From Russian to "dead"

#   set map1 { სტ  სცლგ  სცტ  შც  გილრ  ფაგ  რეოლ  პაგდლ  პეულ  პეალ  ზდ  პგლოე  ზგა  აგელეო  კელეო  კგლეგ  კეოლეგ  კეუგლ  კეალ  გტაგ  აგლ  მეუგლ  ედგეგ  ოდგ  ულდგ  ბლგ  ოლეოგ  ელეოგ  ეუგლ { } {ელ } , ელ, . ელ. { } {ეოგლ } , ეოგლ, . ეოგლ. { } {ოლ } , ოლ, . ოლ. { } {ილ } , ილ, . ილ. { } {აგ } , აგ, . აგ. { } {უგ } , უგ, . უგ. { } {ნოგ } , ნოგ, . ნოგ. { } {ხგ } , ხგ, . ხგ. { } {მეაგ } , მეაგ, . მეაგ.  ეა  შჩ  ეგ  ეუ  ეო  ო  პ  ვ  ხ  დ  ი  ზ  ჟ  ე  ე  კ  რ  ნ  მ  ა  ბ  ლ  ს  ტ  უ  ფ  გ  ც  ჩ  შ  _  ე  ი} #   set decStr "       ,   ,   .        .      ,       ,    ,    ." #    (  lower case) set encStr [string map $map1 [string tolower $decStr]] # stdout puts $encStr

Back:

From the "dead" to Russian

 #   set map2 {სტ  სცლგ  სცტ  შც  გილრ  ფაგ  რეოლ  პაგდლ  პეულ  პეალ  ზდ  პგლოე  ზგა  აგელეო  კელეო  კგლეგ  კეოლეგ  კეუგლ  კეალ  გტაგ  აგლ  მეუგლ  ედგეგ  ოდგ  ულდგ  ბლგ  ოლეოგ  ელეოგ  ეუგლ  {ელ } { } ელ, , ელ. . {ეოგლ } { } ეოგლ, , ეოგლ. . {ოლ } { } ოლ, , ოლ. . {ილ } { } ილ, , ილ. . {აგ } { } აგ, , აგ. . {უგ } { } უგ, , უგ. . {ნოგ } { } ნოგ, , ნოგ. . {ხგ } { } ხგ, , ხგ. . {მეაგ } { } მეაგ, , მეაგ. . ეა  შჩ  ეგ  ეუ  ეო  ო  პ  ვ  ხ  დ  ი  ზ  ჟ  ე  ე  კ  რ  ნ  მ  ა  ბ  ლ  ს  ტ  უ  ფ  გ  ც  ჩ  შ  _  ე  ი } #   set encStr "მეაგ პგლოეზგასელ პაგდლრეოლმელეოგ გილრაგლახაგ მოლ კეუგლზდლიმეილ პეალკგლეგ სფაგიხაგ, ეპეალ ფაგსცლგიბიშჩულდგ პაგდლრეოლეუგლშჩეილ, დოლ ფაგშცელეოგეუგლტ პაგდლრეოლინეგილ. გილრაგელეო კეუგლზდლოლ სტიკეალ პაგდლრეოლმეუგლრეოლ პეალკგლეგ ელ კეუგლლდეოჩედგეგ პეალკლეამკეოლეგ. დოლ მეაგ პაგდლიტ უგ პეალკგლეგ კეუგლზდლეგ, ოლ უგ კეუგლზდლეგ პეალკგლეგ ბლგიდ რეცინოგ ნაენოგ, ეპეალ ეოგლ სცლგოშმაგ პაგდლრეოლმეუგლრ, ელ ნმაეუგლ ტოხგ პაგდლრეოლმეუგლგტაგ." #    set decStr [string map $map2 $encStr] # stdout puts $decStr კეუგლზდლიმეილ პეალკგლეგ სფაგიხაგ, ეპეალ ფაგსცლგიბიშჩულდგ პაგდლრეოლეუგლშჩეილ, დოლ ფაგშცელეოგეუგლტ პაგდლრეოლინეგილ. გილრაგელეო კეუგლზდლოლ სტიკეალ პაგდლრეოლმეუგლრეოლ პეალკგლეგ ელ კეუგლლდეოჩედგეგ პეალკლეამკეოლეგ. დოლ მეაგ პაგდლიტ უგ პეალკგლეგ კეუგლზდლეგ, ოლ უგ კეუგლზდლეგ პეალკგლეგ ბლგიდ რეცინოგ ნაენოგ, ეპეალ ეო #   set map2 {სტ  სცლგ  სცტ  შც  გილრ  ფაგ  რეოლ  პაგდლ  პეულ  პეალ  ზდ  პგლოე  ზგა  აგელეო  კელეო  კგლეგ  კეოლეგ  კეუგლ  კეალ  გტაგ  აგლ  მეუგლ  ედგეგ  ოდგ  ულდგ  ბლგ  ოლეოგ  ელეოგ  ეუგლ  {ელ } { } ელ, , ელ. . {ეოგლ } { } ეოგლ, , ეოგლ. . {ოლ } { } ოლ, , ოლ. . {ილ } { } ილ, , ილ. . {აგ } { } აგ, , აგ. . {უგ } { } უგ, , უგ. . {ნოგ } { } ნოგ, , ნოგ. . {ხგ } { } ხგ, , ხგ. . {მეაგ } { } მეაგ, , მეაგ. . ეა  შჩ  ეგ  ეუ  ეო  ო  პ  ვ  ხ  დ  ი  ზ  ჟ  ე  ე  კ  რ  ნ  მ  ა  ბ  ლ  ს  ტ  უ  ფ  გ  ც  ჩ  შ  _  ე  ი } #   set encStr "მეაგ პგლოეზგასელ პაგდლრეოლმელეოგ გილრაგლახაგ მოლ კეუგლზდლიმეილ პეალკგლეგ სფაგიხაგ, ეპეალ ფაგსცლგიბიშჩულდგ პაგდლრეოლეუგლშჩეილ, დოლ ფაგშცელეოგეუგლტ პაგდლრეოლინეგილ. გილრაგელეო კეუგლზდლოლ სტიკეალ პაგდლრეოლმეუგლრეოლ პეალკგლეგ ელ კეუგლლდეოჩედგეგ პეალკლეამკეოლეგ. დოლ მეაგ პაგდლიტ უგ პეალკგლეგ კეუგლზდლეგ, ოლ უგ კეუგლზდლეგ პეალკგლეგ ბლგიდ რეცინოგ ნაენოგ, ეპეალ ეოგლ სცლგოშმაგ პაგდლრეოლმეუგლრ, ელ ნმაეუგლ ტოხგ პაგდლრეოლმეუგლგტაგ." #    set decStr [string map $map2 $encStr] # stdout puts $decStr . " #   set map2 {სტ  სცლგ  სცტ  შც  გილრ  ფაგ  რეოლ  პაგდლ  პეულ  პეალ  ზდ  პგლოე  ზგა  აგელეო  კელეო  კგლეგ  კეოლეგ  კეუგლ  კეალ  გტაგ  აგლ  მეუგლ  ედგეგ  ოდგ  ულდგ  ბლგ  ოლეოგ  ელეოგ  ეუგლ  {ელ } { } ელ, , ელ. . {ეოგლ } { } ეოგლ, , ეოგლ. . {ოლ } { } ოლ, , ოლ. . {ილ } { } ილ, , ილ. . {აგ } { } აგ, , აგ. . {უგ } { } უგ, , უგ. . {ნოგ } { } ნოგ, , ნოგ. . {ხგ } { } ხგ, , ხგ. . {მეაგ } { } მეაგ, , მეაგ. . ეა  შჩ  ეგ  ეუ  ეო  ო  პ  ვ  ხ  დ  ი  ზ  ჟ  ე  ე  კ  რ  ნ  მ  ა  ბ  ლ  ს  ტ  უ  ფ  გ  ც  ჩ  შ  _  ე  ი } #   set encStr "მეაგ პგლოეზგასელ პაგდლრეოლმელეოგ გილრაგლახაგ მოლ კეუგლზდლიმეილ პეალკგლეგ სფაგიხაგ, ეპეალ ფაგსცლგიბიშჩულდგ პაგდლრეოლეუგლშჩეილ, დოლ ფაგშცელეოგეუგლტ პაგდლრეოლინეგილ. გილრაგელეო კეუგლზდლოლ სტიკეალ პაგდლრეოლმეუგლრეოლ პეალკგლეგ ელ კეუგლლდეოჩედგეგ პეალკლეამკეოლეგ. დოლ მეაგ პაგდლიტ უგ პეალკგლეგ კეუგლზდლეგ, ოლ უგ კეუგლზდლეგ პეალკგლეგ ბლგიდ რეცინოგ ნაენოგ, ეპეალ ეოგლ სცლგოშმაგ პაგდლრეოლმეუგლრ, ელ ნმაეუგლ ტოხგ პაგდლრეოლმეუგლგტაგ." #    set decStr [string map $map2 $encStr] # stdout puts $decStr

Actually debriefing

I am not a linguist in the literal sense of this concept, but first, I speak and write fluently in several languages from different language groups (and in some of them, as in my own). Secondly, the work involves the analysis and analysis of texts, including multilingual ones (search engines, recognition, indexing, etc., etc.). Thirdly, again, partially by the nature of the activity, I perfectly understand how some languages evolved and changed over time. In addition, I am a programmer, i.e. Transformations from one to another, at least in a “machine” way, even in my head, are also inherent to me to a certain extent.

As an initial example, let's take some “transformations” that occurred with individual languages a long time ago:

kiks <-> Kekse <-> koekjes <-> cookies are Danish, German, Dutch (Dutch) and English, respectively.
ting <-> das Ding <-> het ding <-> the thing - they are the same;
each year <-> kozhnag of the year <-> skin rock <-> každého roka - Russian, Belarusian, Ukrainian, Slovak;

What I would like to clarify here: I can not say, in contrast to the linguist, in what sequence and what has changed here, but I have a purely practical concept of this morphology, so to speak, “equation math”.

Grammatical modifications of the syntax of sentences during translation I did not consider in principle, i.e. in the text, everything was “translated” literally. It’s just that with full modification, for example, permutation of words, the chances of recovering the Russian text from such a small passage were tending to zero. Compare, for example, even such close languages of one language group as Danish and German.

The literal translation (morphologically correct, but syntactically worthless):

[DK] du behøver ikke komme, hvis du har bedre ting for
[DE] Du brauchst nicht kommen, falls du hast bessere Dinge vor

Correct translation into German:

[DK] du behøver ikke komme, hvis du har bedre ting for
[DE] Du brauchst nicht zu kommen, falls du etwas besseres vorhast

Therefore, I repeat, the translation is carried out literally, without changing the order of words - i.e. remains the characteristic Russian language syntax.

We define our language as a little “snarling” and “hooting” and make some letters in certain places “hard to hear” or unreadable. Slightly complicate the morphology of writing the language, and the words in it will be slightly longer than in Russian. What we succeed should be reminiscent of Russian with an admixture of Scandinavian (and possibly other Germanic) languages on the one hand, and on the other hand something from the languages of the Turkic group.
So let's get started ...

Translate to the "dead" language

In the original, I translated at once using the symbols of the Georgian alphabet, but people in the comments made it clear that the “hieroglyph” and the “hieroglyph” are different (difficult to understand) - so we will do the translation while in Cyrillic, only at the end using the Georgian alphabet.

For a start, a small subroutine allows you to change syllables and letters simply and with taste, while “distinguishing” the beginning and end of words.

The magic routine [Translate]

 #      : proc magic_text {args} { set text [lindex $args end] foreach {op val} [lrange $args 0 end-1] { switch -- $op \ -regexp { foreach {re val} $val { regsub -nocase -all $re $text $val text } } \ -map { set text [string map -nocase $val $text] } \ -default { error "uknown operation '$op'" } } return $text } #  ,    ,     : # ... ^$, ^$ ... -       proc Translate {dictMap encText} { magic_text -regexp {{(\m[^\s[:punct:]]+\M)} {^\1$}} -map $dictMap -map {^ "" $ ""} $encText }

Trial of the pen - here using a simple dictionary, we will try to rewrite the phrase “Mom washed the frame” in the plural, and instead of “frame” there would be “Roma”.

 #   (^ -  , $ - ): % Translate {^  $  $  $ "" ^ } "  ,    ."   ,    .

Actually we proceed to the creation of a dictionary for our "dead" language.

 #   set RuXy {                                                           $  $  $  $  $  $  $  $  $  } # 1   puts 1:[set ru "       ,   ,   .        .      ,       ,    ,    ."] # 2  : puts 2:[Translate $RuXy $ru]

Result of execution (translation):

 1:       ,   ,   .        .      ,       ,    ,    . 2:       ,   ,   .        .      ,       ,    ,    .

By the way, if you introduce some conventions when reading (for example, "p" and "x" in the end almost always are not pronounced), then you can immediately read the entire sentence almost as in Russian.
The goal has been achieved, now we simply impose the Georgian alphabet, observing vowels and consonants (in order not to completely complicate things).

Below is a script that uses the Georgian alphabet, and translates back and forth:

 #   set RuXz { სტ  სცლგ  სცტ  შც  გილრ  ფაგ  რეოლ  პაგდლ  პეულ  პეალ  ზდ  პგლოე  ზგა  აგელეო  კელეო  კგლეგ  კეოლეგ  კეუგლ  კეალ  გტაგ  აგლ  მეუგლ  ედგეგ  ოდგ  ულდგ  ბლგ  ოლეოგ  ელეოგ  ეუგლ $ ელ $ ეოგლ $ ოლ $ ილ $ აგ $ უგ $ ნოგ $ ხგ $ მეაგ  ეა  შჩ  ეგ  ეუ  ეო  ო  პ  ვ  ხ  დ  ი  ზ  ჟ  ე  ე  კ  რ  ნ  მ  ა  ბ  ლ  ს  ტ  უ  ფ  გ  ც  ჩ  შ  _  ე  ი} #   set XzRu {სტ  სცლგ  სცტ  შც  გილრ  ფაგ  რეოლ  პაგდლ  პეულ  პეალ  ზდ  პგლოე  ზგა  აგელეო  კელეო  კგლეგ  კეოლეგ  კეუგლ  კეალ  გტაგ  აგლ  მეუგლ  ედგეგ  ოდგ  ულდგ  ბლგ  ოლეოგ  ელეოგ  ეუგლ  ელ$  ეოგლ$  ოლ$  ილ$  აგ$  უგ$  ნოგ$  ხგ$  მეაგ$  ეა  შჩ  ეგ  ეუ  ეო  ო  პ  ვ  ხ  დ  ი  ზ  ჟ  ე  ე  კ  რ  ნ  მ  ა  ბ  ლ  ს  ტ  უ  ფ  გ  ც  ჩ  შ  _  ე  ი } # 1   puts 1:[set ru "       ,   ,   .        .      ,       ,    ,    ."] # 2      $RuXz puts 2:[set xz [Translate $RuXz $ru]] # 3      $XzRu puts 3:[set ru2 [Translate $XzRu $xz]]

Well, the result of the execution (translation):

 1:       ,   ,   .        .      ,       ,    ,    . 2:მეაგ პგლოეზგასელ პაგდლრეოლმელეოგ გილრაგლახაგ მოლ კეუგლზდლიმეილ პეალკგლეგ სფაგიხაგ, ეპეალ ფაგსცლგიბიშჩულდგ პაგდლრეოლეუგლშჩეილ, დოლ ფაგშცელეოგეუგლტ პაგდლრეოლინეგილ. გილრაგელეო კეუგლზდლოლ სტიკეალ პაგდლრეოლმეუგლრეოლ პეალკგლეგ ელ კეუგლლდეოჩედგეგ პეალკლეამკეოლეგ. დოლ მეაგ პაგდლიტ უგ პეალკგლეგ კეუგლზდლეგ, ოლ უგ კეუგლზდლეგ პეალკგლეგ ბლგიდ რეცინოგ ნაენოგ, ეპეალ ეოგლ სცლგოშმაგ პაგდლრეოლმეუგლრ, ელ ნმაეუგლ ტოხგ პაგდლრეოლმეუგლგტაგ. 3:       ,   ,   .        .      ,       ,    ,    .

Translation from the "dead" language

Before publishing the article-problem, I checked on myself the possibility of finding a solution. Of course, this is far from a pure test (after all, the result is known to me), but I approached the analysis of morphology purely “scientific”, using some tools and initially translating a completely different passage with my dictionary, taking a piece of text with a random position, unknown to me, from another work and working In most cases only with the Georgian alphabet (which is also a stranger to me by the way).

I must note that it seemed to me almost impossible to make a translation fully automatically or completely manually. In my understanding, I and the computer will work constantly in the coupling.

As a result, a brief script for reverse translation, as I did:

Computer analysis of the text (the proprietary VDK was used, self-written filters for it, Russian stereometers and a lot of scripts, for example, for binding the diff). Initially empty alphabet translation dictionary. The result of the analysis is a set of variants of the “wrong” text in Cyrillic, in which everything is in the nominative case, with a universal morphology in an indefinite form.
Manual analysis of these text variants (search for something readable). The goal is to increase the dictionary used in computer analysis. Repeat from step 1. If the readability of the text has become worse, roll back to the previous dictionary.

That's basically all, and now slightly deployed:
UPD (1)

Computer analysis
The API VDK allows text analysis and processing, up to full-text indexing - an open-source “analog” would be Apache Lucene (api for Solr) if it could do everything that the VDK can.
I have long been eager to completely rewrite my work, scripts, filters, and analyzers without using VDK. So far, unfortunately, only partially succeeded.
Without being distracted by the VDK API, I will still try to deploy my workflow — how everything works.
- We generate a new dictionary for the Translate (see function above), and launch it - we translate the text, which we then feed to the subsequent steps of the workflow.
- "Indexing" of the text using samopisny filters and pre-stemmers. As a result, we have parsed the text.
- We work with words. For this you can use for example accent or dialect filters (with custom rules). Here you can remove the stress, etc. I will show the example of the German Umlauts: the former German Chancellor could be found in the text as “Schröder” or as “Schroeder”. German spelling also changed (old, new), so “Crème” and “Krem” or, for example, “Schmant” and “Schmand” are one and the same word.
  Naturally, in our case we have a “pseudo-Russian” accent dictionary. Therefore, we will carry out the inverse action here, which is usually used not for indexing, but for searching, i.e. we expand each word into several similar ones.
- Next, we set the stemmer. Stemmer allows you to bring the word to a unique form, such as root, indefinite form, nominative case, etc. Smart stemmers, such as those using dictionaries, can delete prefixes or suffixes if the meaning of the word is not lost (that is, the word does not become completely different).
  Naturally purely Russian stemmer is completely inappropriate here. But you can use a custom stemmer, which uses grammatical filters of the Russian language (as far as syntax is concerned) and a new dictionary (initially empty).
- All actions are shuffled with a script (workflow), passing at intermediate stages through other filters, for example, UniqueFilter.
- At the end, diff starts, so that at each iteration it is possible to concentrate only on changes in words or to see new roots, etc.
Manual analysis.
- At each iteration, we look at the changes in the words (the result of the diff) and several variants of the “translation” of the text (also diff); we are looking for any clues in the changes, known or similar words.
- We correct dictionaries of filters, separate rules, rarely workflow itself.
- Repeat the iteration - run the flow again.

UPD (2)
Since there were questions about what is missing in order to decrypt automatically, in other words “why turn on the head” (I exaggerate) and why, in my understanding, a person and a computer should work in pairs here. I will answer here briefly: let us imagine only one situation that exists in many languages - reading and writing are often not consistent, with their single-letter pronunciation. For example, I’ll take German again: the EU syllable in German is pronounced OY , respectively, the words EURO and Europa are pronounced by the Germans as Oiro and Oyropa . And now let's imagine how the German word “ my ” would have written - I guarantee you that it would be “ meu ” and not “ moi ” (if only because in German the new word is “ neu ”, pronounced “ noah ”) . And this is only one situation that takes place - take Watson here, anyway, there’s no head without a head!

The person who solved this riddle and translated the text completely, up to a hash coincidence, sent me an answer via e-mail. Later, he described the way he did it, which is fundamentally different from how I approached solving and translation.
Unfortunately, for many reasons (including time), he is not eager to join the habrasoobshchestvo (and I have all the invites so far, I would send it out of luck :). But if suddenly it turns out jointly or maybe he will allow to publish the decision on his behalf, I will do it with great pleasure. His decision, or rather the concept thereof, despite the fact that unfortunately I did not understand the subtleties, because completely not seen, still very, very impressive.
For reasons that were incomprehensible to me, this modest did not want me to mention his name in the article, and also who he really was and where he was from, I, unfortunately, was not dedicated either. Communication style suggests that this person is still very young. Although ... In general, there are still solid secrets of the Madrid court. I can only say unequivocally that he is “Russian” (in today's times it is more correct to speak Russian :).
...
Thanks again to everyone who participated.

Source: https://habr.com/ru/post/232471/

All Articles