Ordinary (or unusual) Python transliteration

Once there was a need to write a transliteration in Python - from Cyrillic to Latin. From the word "Sith" is obtained "sith", and from the "rustle" comes out "shelest".

It would seem, what is there to write at all - the task is hardly more difficult to print "Hello world". And this is partly the case - but not quite.

The fact is that some letters in Russian in transliteration are converted not into one but several Latin letters at once: these are F, C, H, W, U, Y and I. In fact, if the transliteration rules were to be converted into one a Latin letter, then transliteration of Russian into English would really not be much more complicated than that very simple program.
')
But, since we are not exactly going to change the transliteration rules, we will see what happens when we use the usual transliteration.

For example, the phrase “SHAPKA and Yulia” is converted to “SHAPKA and YUlya”, or “ShAPKA and Yulya” - depending on what is specified in the transliteration table for “W” and “U” (sometimes “SH” and “YU”, and sometimes “Sh” and “Yu”).

That is, the case of the next letter in the standard transliteration functions is not taken into account, and all letters in upper case are replaced by the general rules. Therefore, in the course of transliteration for the words “BOWL” and “Schi” it is easy to get something like “ChAShA” or “SCHi”, when in reality we would rather like to get “CHASHA” and “Schi”.

Nevertheless, all found implementations of the transliteration from Cyrillic to Latin in Python, as it turned out, did not take this feature into account. These are the numerous solutions given in the forums , and the pytils library, which implements transliteration in one of its modules.

So, we will write our transliteration function, with blackjack and ^ W ^ W ^ W ^ H ^ H. :)

So, the first option is to go through the string character by character. In this case, it is always known which character is next (if it is not the last character in the string).

The principle is as follows:

For each character it is checked whether it is among the keys of the dictionary lower_case_letters.
If so, it is replaced with the value for the given key in lower_case_letters.
If not, it is checked whether this symbol is among the keys of the capital_letters dictionary.
If it is, it is checked whether the last character is in a string (if the length of the string is greater than the position of the current character + 1, then this is not the last character, provided that the position is counted from zero).
If the character is not the last, then it is checked whether the next character is among the keys of the lower_case_letters dictionary.
If not, or if it is the last character in the string, then the current character is replaced with the value for the given key in the capital_letters dictionary, the upper () method is used for the value — that is, “SH” is SH, and so on.
If the next character is among the keys of the lower_case_letters dictionary, then the upper () method does not apply.

The advantages of this option are that it is slightly shorter than the next four (actually only a few lines), and, more interestingly, it works very quickly with very small lines. But with large lines, on the contrary, it works very slowly - that is, the time required for transliteration of a line strongly depends on the number of characters in a line - on the graph you can see exactly how the transliteration time changes for different numbers of characters.

For greater accuracy, the execution time (both on this and the other graphic) is indicated not for one transliteration operation, but for 500. The processor installed on the computer is Intel Core 2 Quad CPU Q9400 @ 2.66GHz.

Schedule

Well, it became obvious that for faster processing of long strings it is necessary to somehow use the replacement of each individual character along the entire string (replace). And as a matter of fact, not one character can be replaced, but a group of characters at once. So there was a second option.

In general, it must be said that the replacement of groups of characters is not the only solution that came to mind. First, for example, the idea arose of dividing a line into words, then checking the case for each word (if the word in which all characters are converted to upper case is equal to the original word, then the whole word is in upper case), and if the word is written in upper case then apply a separate transliteration table, especially for words in upper case. Accordingly, in this case the letter “” in the word “Chess” will be transliterated as “Sh”, but in the word “CHESS” already as “SH”.

True, in words written in a mixed register (for example, "SHAHMATS"), the usual transliteration table will be used in any case (that is, it will turn out to be ShAHmatY). But this is, in general, not so important.

Or, alternatively, it would be possible to make it even more universal - use replace for words written in lower case, and process the remaining words character-by-character. But it probably would have been slower now, because there are likely to be many such words.

Nevertheless, let us return to the second written version . His principle is to single out those Russian letters, which are represented as several Latin characters, into a separate dictionary, and then create a dictionary (transliteration table) from pairs of characters, where for each of these letters (F, C , W, W, U, Y and I) there are 33 different versions, with each of the lowercase letters of the Russian alphabet. Accordingly, it remains to first make a replacement for such pairs of characters, and then for all other characters that are not translated.

But it was precisely this option that turned out to be very slow (in fact, implementation turned out to be slow - but more on that below).

More precisely, it is faster than the first option only with a very large number of characters - about 10 thousand.

And when processing lines from 100 to 1000 characters, the first option is approximately 10 times faster.

Well, 7 * 33 = 231. That is, 231 additional replacements. Of course, in the dictionary we will not meet many of these combinations of letters - that is, theoretically, fewer replacements could have been done. But, on the other hand, the text can consist not only of words that are in the dictionary, but in general, it would not be desirable to introduce such restrictions once again.

In fact, as it turned out, the point is not that a large number of replacement operations are performed, but how exactly the replacement of characters is implemented (see the fourth option).

However, first consider the third option.

The third option is somewhat similar to the second, but there are no longer 231 separate replacements for every possible combination of F, C, H, W, U, Y and I with lowercase letters. Instead, it uses a regular expression that is used for just 7 substitutions. Each of the letters corresponding to several Latin letters, followed by a lowercase character ([az]) is replaced by the Latin representation of this letter, followed by the same lowercase character (without transliteration). After that, respectively, the remaining lowercase letters are replaced separately, as, by the way, and capital. When replacing capital letters corresponding to several Latin characters, the upper () method is used for the Latin representation of the letter.

And this option has just turned out to be very fast compared to the second.

On lines of 100 characters, it should be noted, slower than the first option. But on the 1000 character line, it is already twice as fast as the first option, and 19.5 times faster than the second.

In principle, this could have been completed, but in reality there is something else that can and should be added.

If we write import this, then we get the well-known text, the Zen language of Python. Among other things, there is a phrase that "there must be one - and, preferably, only one - the obvious way to do it." However, anyone who has been programming in Python for a long time knows that there are cases when there are quite a few ways to do the same thing, and using any of them the result may be exactly the same, but the time spent on the operation can differ very much.

So, for example, u "Your name is% s"% username runs much faster than u "Your name is" + username. This is well and thoroughly written in the article “Efficient String Concatenation in Python” .

Similarly, the string can be replaced by re.sub, or the string replace method can be used. Moreover, if any replacement does not use regular expressions, it is strongly recommended to use the second method, which, moreover, works much faster. By the way, in the same pytils the string replace method is used.

So, the fourth option (the second version with edits). Only three lines were edited, but the productivity increased tremendously.

This option is better than all the others with lines of 1000 and 10000 characters. On lines of 100 characters, the difference compared with what was (the second option) is also very large.

But what about editing the third option? After all, of the three cycles where the replacement of characters is performed, regular expressions are used only in one. Well, great, edit and the third option, too.

The fifth variant differs from the third one only in that in two cycles, where regular expressions are not used for the replacement, the replacement operation is performed using the string replace method.

Without a doubt, this should have greatly affected the performance: the speed of the fifth variant turned out on average 1.5 times more than the third. Moreover, it turned out that on the 100-character string, the fifth version works even faster than the first (that is, faster than character-by-character processing).

Schedule

It turns out that replacement algorithms turned out to be faster than character-by-character processing for lines of 100, 1000, and 10000 characters.

And yet, each option has its advantages and disadvantages. For example, even though in the fifth version, a line of 100 characters is processed faster than using character-by-line traversal, on very small lines (several characters, and sometimes this) character-by-character processing will still be the fastest option.

Perhaps the most correct solution is a combination of the best algorithm for processing short lines and the best algorithm for processing long lines. That is, first of all, we check the length of the string, and then, depending on this, we use either one or another algorithm.

In conclusion, I want to quote Mike Hertel (Mike Haertel - author of GNU Grep) from his post on the freebsd-current mailing list regarding why GNU Grep works so fast:

The key to making programs. ;-)

"The key to making programs quick is to make sure that they practically do nothing."

If there are any other thoughts - feel free to add, add. If there are any other improvements, I think it may be interesting for many.

Source: https://habr.com/ru/post/137089/

All Articles

Ordinary (or unusual) Python transliteration

More articles: