X Neural Switcher - Cookbook (Part 2). Algorithms

Greetings.

Today I will talk about what input language recognition algorithms are used in the X Neural Switcher.

As you know, Punto Switcher uses dictionaries in its work, based on the impossibility of a specific letter combination for a given language. You can see them if you have Punto installed, here is% PROGRAMFILES% \ Yandex \ Punto Switcher \ Data \ triggers.dat. At least, google told me so. These dictionaries are encrypted (easy, but still).
Additionally, Punto uses custom exceptions. I cannot tell where they are stored - the program is closed.
')
Now back to xneur. At the time when I switched to using linux (2005), there were exactly two programs that would declare the ability to recognize and switch the layout - sven (still alive) and xneur (half dead). I chose to finish xneur, then it was version 0.0.3 (http://www.linux.org.ru/forum/talks/811959). At that moment, it was actually abandoned by the original developers, and for more than half a year there was no movement there.

At that time, xneur was more than wretched. For lovers of thrills, so to speak, the People still have version 0.0.3 - it was awful.

The main idea of pattern recognition then was the idea of the "weight" of letter combinations. In short, a rather large text was taken, and on its basis it was considered how often a particular two-letter combination occurs. His “weight” for a specific language was calculated, and was written to a file.
Xneur had to break the word into letter combinations and calculate the “weight” of the word for each layout. If the value was above a certain threshold (defined by eye), then the word was translated into this layout.

After trying what was at the time when I became interested in xneur, I realized that this was no good. Began to think how to finalize all this.

Before the internal recognition algorithms were fixed, a couple of years of trial and error passed. I have no philological or linguistic education, all the algorithms were invented on the basis of logic and influx.

My conscience didn't allow me to use Punto's dictionaries, as well as the consideration that I thought of xneur to be multilingual, that is, the Russian-eng dictionary simply did not suit me.

Further I will describe the algorithms used by the program at the moment.

Algorithms

Xneur currently uses a multi-level language definition. Those. I simply ranked the algorithms according to their degree of reliability. Probably, many of you will see here a kind of ranking applied in antiviruses.

As an input to the layout definition function, an array of strings is passed, each of which is a letter combination typed in each of the layouts installed in the system.

At each stage, the program at the beginning checks whether it is possible to leave the typed word in the layout that is already included. And only if the word does not fit the current layout, then the remaining layouts are checked.

Next will go options for decreasing the reliability of switching.

User dictionary

At this stage, the word is checked using dictionaries, which I either predetermined beforehand (for example, switch e-mail, IP, MAC to English layout), or were added by the user, or were added by the program itself during the self-learning process.

Previously, user dictionaries and regular expressions (ip, MAC, etc.) were divided into separate settings files, but later they were combined, which allows the user to add rules more flexibly. For example, by adding the letter combination “ped”, you can specify where the letter combination should occur in the word for the switch to occur - at the beginning (“ ped al”), at the end (“pedal ped ”) or just the word should contain this letter combination (“ wiki ped ia ”).
Moreover, for highly advanced users, you can create regular expressions yourself using the Perl regular expression standard. But in my memory, no one uses this opportunity. Maybe people do not know about it, eh?

System dictionary

If the word does not occur in any of the user's dictionaries, the program proceeds to the next level of verification - to the spell checker check. As they say, thanks to those who invented and implemented them.
Xneur can, depending on the build options, use either the aspell library or (by default) a wrapper over the enchant dictionaries.

The aspell library is powerful enough, but already a bit outdated spelling checker library. If you use the linux package distribution, the dictionary packages will be called aspell-ru, aspell-uk or aspell-en depending on your language.

It is from this library that the dictionaries support in xneur began. Unfortunately, the number of available dictionaries for this library does not increase, and that is why I had to look for a replacement for it.

As such a replacement, the gaze fell on myspell and hunspell , and then on the magnificent wrapper over Enchant dictionaries. This wrapper above dictionaries is used by AbiWord, LibreOffice and Firefox spell checkers.
Enchant is just a wrapper over various dictionary libraries, not the dictionary library itself. It combines various dictionary engines under its wing, combining ease of programming and speed.
Basically, in Ubuntu, Enchant uses Myspell dictionaries, and only for rare languages connects other dictionaries.

Therefore, xneur by default also uses Enchant.

Heuristic analysis

If the sources of reliable information in the form of dictionaries could not determine whether a word belongs to any of the layouts, then the most unreliable, but, in fact, the most interesting step - the heuristic analysis - comes into play.

After much trial and error, I still came to the existing version. But in the beginning I will talk about not suitable, intermediate options.
"Weight" of the letter combination . I already mentioned this option at the beginning of the article, but I will write again. In short, a rather large text was taken, and on its basis it was considered how often a particular two-letter combination occurs. The letter combination was supposed to be in Latin. The “weight” of the letter combination for a specific language was calculated, and was written to a file.
Xneur had to break the word into letter combinations and calculate the “weight” of the word for each layout. If the value was above a certain threshold (defined by eye), then the word was translated into this layout.

In general, this algorithm had to be immediately abandoned due to the complexity of calculating the weight and the ephemerality of the switching feature. Although it does not leave me feeling that this algorithm can be refined and make candy.

Valid for the combination of letters . It's easier here. The word belongs to one or another layout, if all the letter combinations of the word are included in a specific dictionary.

Not drove - long and expensive in terms of resources. At the same time, the letter combinations were always kept in Latin, which added fun with the recoding.

And this is how we arrived at the modern version of heuristics - unacceptable for the combination of letters . At this point, it was understood that the letter combinations should be stored in the letters of the language that we check to reduce the cost of translating the layout.

According to current algorithms, a word cannot belong to one or another language, if the letter combination of this word is found in the list of such letter combinations. Lists of letter combinations we called proto-languages . On the basis of proto-languages, heuristics are built.

Proto-languages in xneur of two types - large and small. Large letters are combinations of 3 characters, small letters are 2 characters long. Why do you need this separation? It's simple - big proto 10-15 times smaller. This reduces the time to process a word. Those. first the word is checked by the large proto, and then (if the identification did not happen) by the small.

The proto-languages themselves are obtained using xneur itself. It already has a mechanism for out of generation. Simply install the necessary layout into the system, take the text in this language, and more genuinely, launch xneur with the necessary key, and go!

Proto-languages are the last of the recognition algorithms. If the word has passed all stages and has not encountered any obstacles, then it remains in the layout as it was.

Summarizing

As I have shown, the xneur algorithms are monstrously simple.

Probably, someone expected me to reveal some terrible and complex algorithms, such as Markov chains, but this did not happen.
When developing xneur, I followed the simple KISS rule - make it simple, you fool! And as practice has shown, the level of false positives with such algorithms is very low. Of course, at the moment the user is required to put the system dictionaries, but this is not a big payment for the processing speed.

If you have any ideas or ideas on this topic, I will be happy to discuss them in the comments. After all, it is in the community that the power of open source!

Previous parts

X Neural Switcher - Cookbook (Part 0). Introduction Build and configure
X Neural Switcher - Cookbook (Part 1). Forerunners and analogues

In further articles I will talk about:

Setting up xneur as a keylogger.
Glitches and jokes xlib, locale and different DE.

Source: https://habr.com/ru/post/132851/

All Articles