Reversible Cyrillic transliteration

Perhaps someone else remembers writing SMS, and sometimes letters, in “transliteration.” But why transliteration today, when everywhere is already unicode? Unfortunately, inherited applications go out of service much slower than we would like. For example, today scanners are used that do not allow the Cyrillic alphabet in the names of patients. Given that the information system used by the same department perfectly understands the Cyrillic alphabet. And the tomograph operator needs not only to call the patient for research, but also to correctly write down his last name in some documents. Similar situations may occur in different places.

That is, the problem arises to somehow transfer the text data to the legacy system in order to:

the person - the operator of the inherited system was able to read the received text "by sound"
if necessary, it would be possible to unambiguously restore the original Cyrillic text

To avoid being bored, let's add more detailed requirements related to compatibility and simplicity for a person:

use only letters in the narrow sense, without punctuation and diacritical elements (this will also save the register)
convert each original letter independently of the rest (without complications like “at the beginning / end of a word”, etc.)
replacements as short as possible, ideally single-letter
inverse transformation rules are as simple as possible, for example, replacements must meet the Fano condition
close to sound substitutions, in the view of the "ordinary person" - in practice, it is a kind of mixture of Latin, English, French, German and, sometimes, Spanish phonetics

Of course, the listed requirements are not quite (except for the first two), but rather heuristics.

You can find many ready-made options for translating the Cyrillic alphabet into Latin. But among them there was nothing that would satisfy all the requirements to an acceptable degree. Either uses accented characters as standards, then throws out letters (usually “b”), then offer irreversible (u -> shch) or phonetically wild (w -> w) replacement options, or have other fatal flaws .
')
So we make our bike. Actually, you need to create a correspondence table, and describe the algorithm for converting there and back.

Table

Let's start with all the obvious one-letter replacements:

BUT

AND

ABOUT

WITH

Have

Remembering the requirement of possible short replacements, and since for “C” we use “S”, with a clear conscience we use the symbol “C” for “C”.

For the remaining letters, the tradition (and simply the lack of Latin characters) tells us to use two-letter combinations. For simplicity, the direct and, in particular, the inverse transformation, it would be good if the combinations form characters that are not used outside the combinations. By Fano's condition, such a special character should be at the beginning of the combination, but the tradition is too strong and we will write the letter “H” at the end of the combination. But, if the “H” symbol is not used separately and the conversion algorithm is allowed to “return” (in fact, remember) to the previous input symbol, for the postfix combinations, the “inverted” analogue of Fano condition can be considered fulfilled. That is, the algorithm will be able to uniquely determine them.

The special symbol for consonants is the same everywhere - “H”. And for vowels, there are two options: "Y" and "J". Although “Y” is more familiar, it is also often used separately, for “Y” or for “Y”. And "J" is rather perceived as a purely auxiliary character.

Resolved, use for vowels "J". And by the way, the freed "Y" is used for "Y".

Since “J” is now a special character, it cannot be used for “”, and only “ZH” remains. Similarly, for “X” you cannot use “H”, and only “KH” remains.

Now we can write down common and selected combinations and single characters:

BUT

AND

ABOUT

WITH

Have

The common and good (in the sense of our requirements) replacements ended here, and we are embarking on the unsteady ground of “gag”, analogies and compromises.

Let's start with "Y". "Y" is already busy (remember about reversibility), and phonetically it is a bad substitute. Let's look at the solution for "E" (taken, by the way, from ISO / R 9, 1968). By analogy, "Y" should be replaced by "IH". It is strange that this option has not met anywhere.

With “Yo” the situation is also strange. There is a clear, but not suitable for us option "E". And there is a phonetic option "JO". But in the Russian alphabet "E" is not accidentally made on the basis of "E", and not "O". "E" often alternates with "E", for example, "Kl yo n - kle new", and never alternates with "O". This results in another heuristic - the “alphabetic” (non-phonetic and non-graphic) proximity of letters. As a result, for "E" we construct a replacement for "JE".
Let's pause:

BUT

AND

ABOUT

WITH

Have

How nice it would be to stop at this and say that the problem was solved in the first approximation. But there are still three letters, without which it can not do. For the remaining letters there are no adequate options. Signs are usually replaced with apostrophes, and literal substitutions are either just arbitrary or “witty,” like “b” -> “q”. For "U", a replacement without accents is usually 3 - 4 characters long, and there will still be problems with it.

After long searches and sufferings, I had to stop at this reasoning: for letters that do not correspond to sounds, you cannot use letters for which sounds exist. And we are left with only “special” symbols for the formation of combinations. But according to Fano's condition they cannot be used separately, the combinations will become ambiguous.

Exit - use a combination of special characters with each other. This, apparently, will complicate the conversion algorithm a little more, but it seems possible to preserve uniqueness.

For a hard sign (it’s only a separator), the replacement of “HH” seems intuitively appropriate (it cannot be read like a pause, a separation).

And for a soft sign, the chain of associations (“J” -> iroded vowels -> softening of the previous consonant) + (“H” -> division) leads to the replacement of “JH”.

You can not call a beautiful decision, but among rotten apples, the choice is not great.

Unfortunately, such a choice makes it impossible to use the replacement “Sh” -> “SHH”. The sequence "SHH" will mean "", and this combination is found in Russian (for example, "congress"). Here again there are no nice decisions, and it is necessary to look for at least somehow motivated. The sound “” is close to the softened “”, and by analogy with the soft sign, this can be represented by the prefix “J”. I understand that now I refer to myself that the code is still 3 and not standard. But, as they say, "other writers we have for you are not."

As a result:

BUT

AND

ABOUT

WITH

Have

Jsh

Algorithm

Converting from Cyrillic to Latin is trivial. We do not pay attention to the register for brevity.

Java code

public class Translit { public static String cyr2lat(char ch){ switch (ch){ case '': return "A"; case '': return "B"; case '': return "V"; case '': return "G"; case '': return "D"; case '': return "E"; case '': return "JE"; case '': return "ZH"; case '': return "Z"; case '': return "I"; case '': return "Y"; case '': return "K"; case '': return "L"; case '': return "M"; case '': return "N"; case '': return "O"; case '': return "P"; case '': return "R"; case '': return "S"; case '': return "T"; case '': return "U"; case '': return "F"; case '': return "KH"; case '': return "C"; case '': return "CH"; case '': return "SH"; case '': return "JSH"; case '': return "HH"; case '': return "IH"; case '': return "JH"; case '': return "EH"; case '': return "JU"; case '': return "JA"; default: return String.valueOf(ch); } } public static String cyr2lat(String s){ StringBuilder sb = new StringBuilder(s.length()*2); for(char ch: s.toCharArray()){ sb.append(cyr2lat(ch)); } return sb.toString(); } }

For example, the result of a couple of well-known pangrams:

Shirokaja ehlektrifikacija juzhnihkh guberniy dast mojshnihy tolchok podhhjemu seljhskogo khozjaystva.
Shheshjh zhe ejshje ehtikh mjagkikh francuzskikh bulok da vihpey chaju.

It does not look very good, but the main purpose of this variant of transliteration is still the full name:

Aleksandr Ivanovich Lebedjh
Georgiy Konstantinovich ZHukov

Inverse transformation is much more interesting. Especially considering that it would be good to explain it to a person (not from IT) for execution “in mind”.
Apparently, it is necessary to start with special cases.

As we read from left to right, the first thing we pay attention to the symbol "J". Behind him must go one of the five characters: "E", "H", "U", "A" or "S" (for "S" should in this case be necessarily more "H"), and it turns out what's in the table for two or three letter combinations.
If “J” is not present, we look, whether the letter “H” follows the symbol. Here is the most difficult moment for attention: in this case there should not be an option when the third character again goes "H" (this is the code "HH"). That is, you need to see and analyze three characters in a row. This is where the violation of the Fano condition auknulo (well, that once).
If neither “J” nor a single “H” in the vicinity of the symbol was found, safely replace it in the table as a separate letter.

After a short workout, as practice shows, people are able to perform the inverse transformation manually. But to force them to do it without special need, of course, is not necessary. You can also automate (again, for simplicity, only for the string and in uppercase):

Java code

  public static String lat2cyr(String s){ StringBuilder sb = new StringBuilder(s.length()); int i = 0; while(i < s.length()){//     .  ,     char ch = s.charAt(i); if(ch == 'J'){ //    i++; //      ch = s.charAt(i); switch (ch){ case 'E': sb.append( ''); break; case 'S': sb.append( ''); i++; //      if(s.charAt(i) != 'H') throw new IllegalArgumentException("Illegal transliterated symbol at position "+i);//      break; case 'H': sb.append( ''); break; case 'U': sb.append( ''); break; case 'A': sb.append( ''); break; default: throw new IllegalArgumentException("Illegal transliterated symbol at position "+i); } }else if(i+1 < s.length() && s.charAt(i+1)=='H' && !(i+2 < s.length() && s.charAt(i+2)=='H')){//  ,      .          . switch (ch){ case 'Z': sb.append( ''); break; case 'K': sb.append( ''); break; case 'C': sb.append( ''); break; case 'S': sb.append( ''); break; case 'E': sb.append( ''); break; case 'H': sb.append( ''); break; case 'I': sb.append( ''); break; default: throw new IllegalArgumentException("Illegal transliterated symbol at position "+i); } i++; //   }else{//   switch (ch){ case 'A': sb.append( ''); break; case 'B': sb.append( ''); break; case 'V': sb.append( ''); break; case 'G': sb.append( ''); break; case 'D': sb.append( ''); break; case 'E': sb.append( ''); break; case 'Z': sb.append( ''); break; case 'I': sb.append( ''); break; case 'Y': sb.append( ''); break; case 'K': sb.append( ''); break; case 'L': sb.append( ''); break; case 'M': sb.append( ''); break; case 'N': sb.append( ''); break; case 'O': sb.append( ''); break; case 'P': sb.append( ''); break; case 'R': sb.append( ''); break; case 'S': sb.append( ''); break; case 'T': sb.append( ''); break; case 'U': sb.append( ''); break; case 'F': sb.append( ''); break; case 'C': sb.append( ''); break; default: sb.append(ch); } } i++; //     } return sb.toString(); }

Total

It would seem that a simple and long-solved problem, and what a space for creativity and discussion.

Seriously, it turned out a working algorithm for reversible transliteration of all the letters of the Russian Cyrillic alphabet into Latin letters. In this case, the result, with a discount on the rigidity of the requirements, is acceptable to read. Can be used to integrate with legacy systems and libraries, to generate identifiers.

I hope someone will find the solution useful, and the path to it will be entertaining.

Addition

Based on the discussion in the comments. Need to be shorter and more formal.
There are no cancellable requirements:

As a result of transliteration, only letters of the main Latin should be obtained .
(there are 26 of them)
abcdefghijklmnopqrstuvwxyz
Transliteration must be completely reversible

There are standards . None meet the requirements.
It is necessary to choose one standard and modify it minimally, only to meet the requirements.
Modifications are uniquely determined by the chain. If you do not like the result, write, please, from what point of the chain do you disagree.

To justify the decisions a bit of theory.

Transliteration is the exact transmission of characters from one script by characters from another.
Not to be confused with phonetic transcription - sound transmission is encouraged, but not guaranteed.
Graphic similarity of symbols has the lowest priority. For example, the transfer of the letter ha like x is not phonetically acceptable.

Transliteration can be thought of as encoding characters in the source alphabet with variable length codes from the characters in the target alphabet. The code can be:

single characters
a prefix and the base character following it
base character and some postfix following it
base character with prefix and postfix

The prefix and postfix can be of different lengths, of course, the shorter the better. And of course, it would be nice to have fewer different prefixes and postfixes.
For Cyrillic and Latin, it is quite possible to confine one prefix and one postfix, both in one character.

For the “easy reversibility” of codes, we introduce the following condition:

No code should start with a postfix and should not end with a prefix.

This is my generalization of the prefix code .
If this condition is met, it can be argued that in any fragment of the resulting sequence there will be no long “false codes”. That is, it is clear that you can cut off the prefix or postfix, and the remaining basic symbol will coincide with a single one.
This can not be avoided and it will have to be remembered. But it does not happen that a piece of a composite year is considered together with a neighboring single character as an unplanned composite code.
For example, suppose we use the code "S", the code "SH" and the code "HH" (violates the condition, starts with a postfix). Then in the sequence "SHH" (the third and first codes), you can select the fragment "SH" (corresponds to the second code).
For prefixes, the violation of the “easy reversibility” condition is not so unpleasant (the difference is because the analysis goes from left to right). But it also complicates the perception "by eye" - with a cursory review, we read the word as a whole, and not consistently, and we can "catch hold of" a random combination.
A particular consequence of the condition introduced is the prohibition of the use of the prefix or postfix as single characters.

For Cyrillic and Latin, postfix is without the “H” variants.
The prefix is “Y” or “J”. If you use the prefix "Y", it can not be used to transfer "Y" or "Y". That is, for two letters (a "» "is quite frequent) you will have to come up with non-standard ones, far from phonetics, and most likely long codes.
There is no problem with "J". We didn’t want to use this symbol separately.

I. Choose a starting standard.
The most phonetically correct, of course, is BGN. But BGN is fundamentally (even deliberately) not reversible.
Closest to the base Latin and to reversibility, in my opinion, “GOST 16876-71 / table 2”, and we choose it.

...?

Ask why the “GOST 7.79-2000 / system B” is not relevant now. Mainly for “X” -> “X” and “Y” -> “J”. Well, today relevant - and tomorrow, as the previous GOST.

BUT

AND

ABOUT

WITH

Have

SHH

″

′

The scheme is not perfect in terms of our requirements. Have to change.
Ii. The first striking "JJ". Why this is bad, you can see in the "theoretical" spoiler. We follow the rule “look at BGN in any incomprehensible situation”. That is, "Y" -> "Y".
Iii. Now left without the code "Y". BGN does not help. There is a phonetic analogy in the “I-S” and “E-E” pairs. To obtain the code of the letter "E", the standard adds a postfix to the code "E". We also do: “Y” -> “IH”.
Iv. Non-letter replacements for “b” and “b” remain. In order not to destroy phonetics, we can use only prefixes and postfixes.
V. Note that no one has forbidden (in the "theoretical" spoiler) to use postfix as a base character in combination with the prefix (at least), and vice versa. That is, we have the codes "JH", "JHH" and "JJH".
Vi. It remains to distribute this wealth. The most frequent “b” is the shortest code: “b” -> “jh”.
VII. “Kommersant” has no sound, “H” is easier not to voice when reading. So, choose from the remaining code, where more “H”: “” -> “JHH”.

Happened:

BUT

AND

ABOUT

WITH

Have

SHH

Jhh

The “SHH” code is the only one using postfix length 2. But the total code length is only 3, the letter is rare. Reversibility is not violated (and even “easy reversibility”).
There is nothing more in the standard to touch.

Java code for illustration

 package tools; import static java.lang.Character.toUpperCase; /** *      * Created by vladimir on 25.08.15. */ public class Translit { public static String lat2cyr(String s){ StringBuilder sb = new StringBuilder(s.length()); int i = 0; while(i < s.length()){//     .  ,     char ch = s.charAt(i); boolean lc = Character.isLowerCase(ch); //    ch = toUpperCase(ch); if(ch == 'J'){ //    i++; //      ch = toUpperCase(s.charAt(i)); switch (ch){ case 'O': sb.append(ch('', lc)); break; case 'H': if(i+1 < s.length() && toUpperCase(s.charAt(i+1))=='H') { //    ( JHH) sb.append(ch('', lc)); i++; //   }else{ sb.append(ch('', lc)); } break; case 'U': sb.append(ch('', lc)); break; case 'A': sb.append(ch('', lc)); break; default: throw new IllegalArgumentException("Illegal transliterated symbol '"+ch+"' at position "+i); } }else if(i+1 < s.length() && toUpperCase(s.charAt(i+1))=='H' ){//  ,      .          . switch (ch){ case 'Z': sb.append(ch('', lc)); break; case 'K': sb.append(ch('', lc)); break; case 'C': sb.append(ch('', lc)); break; case 'S': if(i+2 < s.length() && toUpperCase(s.charAt(i+2))=='H') { //     sb.append(ch('', lc)); i++; //    }else{ sb.append(ch('', lc)); } break; case 'E': sb.append(ch('', lc)); break; case 'I': sb.append(ch('', lc)); break; default: throw new IllegalArgumentException("Illegal transliterated symbol '"+ch+"' at position "+i); } i++; //   }else{//   switch (ch){ case 'A': sb.append(ch('', lc)); break; case 'B': sb.append(ch('', lc)); break; case 'V': sb.append(ch('', lc)); break; case 'G': sb.append(ch('', lc)); break; case 'D': sb.append(ch('', lc)); break; case 'E': sb.append(ch('', lc)); break; case 'Z': sb.append(ch('', lc)); break; case 'I': sb.append(ch('', lc)); break; case 'Y': sb.append(ch('', lc)); break; case 'K': sb.append(ch('', lc)); break; case 'L': sb.append(ch('', lc)); break; case 'M': sb.append(ch('', lc)); break; case 'N': sb.append(ch('', lc)); break; case 'O': sb.append(ch('', lc)); break; case 'P': sb.append(ch('', lc)); break; case 'R': sb.append(ch('', lc)); break; case 'S': sb.append(ch('', lc)); break; case 'T': sb.append(ch('', lc)); break; case 'U': sb.append(ch('', lc)); break; case 'F': sb.append(ch('', lc)); break; case 'C': sb.append(ch('', lc)); break; default: sb.append(ch(ch, lc)); } } i++; //     } return sb.toString(); } public static String cyr2lat(char ch){ switch (ch){ case '': return "A"; case '': return "B"; case '': return "V"; case '': return "G"; case '': return "D"; case '': return "E"; case '': return "JO"; case '': return "ZH"; case '': return "Z"; case '': return "I"; case '': return "Y"; case '': return "K"; case '': return "L"; case '': return "M"; case '': return "N"; case '': return "O"; case '': return "P"; case '': return "R"; case '': return "S"; case '': return "T"; case '': return "U"; case '': return "F"; case '': return "KH"; case '': return "C"; case '': return "CH"; case '': return "SH"; case '': return "SHH"; case '': return "JHH"; case '': return "IH"; case '': return "JH"; case '': return "EH"; case '': return "JU"; case '': return "JA"; default: return String.valueOf(ch); } } public static String cyr2lat(String s){ StringBuilder sb = new StringBuilder(s.length()*2); for(char ch: s.toCharArray()){ char upCh = toUpperCase(ch); String lat = cyr2lat(upCh); if(ch != upCh){ lat = lat.toLowerCase(); } sb.append(lat); } return sb.toString(); } /** *      */ private static char ch(char ch, boolean toLowerCase){ return toLowerCase? Character.toLowerCase(ch): ch; } /** *  */ public static void main(String[] args) { String s1 = cyr2lat("  "); String s2 = cyr2lat(" "); String s3 = cyr2lat("         "); String s4 = cyr2lat("         ."); String s5 = cyr2lat("                                "); System.out.println(s1); System.out.println(s2); System.out.println(s3); System.out.println(s4); System.out.println(s5); System.out.println(); System.out.println(lat2cyr(s1)); System.out.println(lat2cyr(s2)); System.out.println(lat2cyr(s3)); System.out.println(lat2cyr(s4)); System.out.println(lat2cyr(s5)); } }

The code is given only for experiments and a descriptive description of the inverse transform algorithm.

For industrial needs of transliteration, there are appropriate solutions (although there is no ready-made solution for requirements 1 and 2).

From industry standards, transliteration is included in the Unicode Common Locale Data Repository Project (CLDR) .
There is a very powerful implementation including CLDR: International Components for Unicode .
Specifically, the Java version of ICU: ICU4J .
There is a framework for describing and performing transliteration (and much more).
For Russian Cyrillic there are ready-made implementations:
1. ISO 9. Reversible, but with diacrites.
2. BGN. Without diacrites, but with punctuation marks and irreversible.
There is an "indefinite plan" to add GOST.
If there is time and energy to figure it out, I will do and lay out the implementation of my scheme using ICU4J.

Thanks to constructive criticism in the comments, the decision has changed. My understanding of the problem has become deeper. I thought about the "adult" implementation.
Thanks to all! Habr makes things better.

Source: https://habr.com/ru/post/265455/

All Articles

Reversible Cyrillic transliteration

Table

Algorithm

Total

Addition

More articles: