📜 ⬆️ ⬇️

Preprocessors and meta-languages ​​in error correction programs

Computational linguistics is very conservative, despite the huge number of already created programs that solve very complex problems of preprocessing targeted languages ​​(such programs are rarely used in spell checkers). Further, using the example of the generally accepted "difficult" case of adjusting "tsya" and "tsya", I will try to show how the "conservatism" of programmers leads to a specific class of errors.

A. Reformatsky, being a very shrewd linguist, wrote that there are mistakes as a result of inept teaching in the school of grammar. Alpatov, a venomous man, remarked, I quote: “one can say, Russian grammar took as its basis the submission of Petersburg Germans about the Russian language”. Considering the defects in school teaching and the psychological specifics of the compilers of grammars, they remain outcasts in computer spell checkers.

Apparently, these cases are also obliged to their gloomy glory by the fact that the grammars strongly recommend to “check” the correctness of the spelling with questions “what to do” and “what does”, excluding other methods. Of course, if we proceed in the recommended way, no algorithmization and subsequent adjustment are possible. Only the “anagrammatic approach” remains (when we get several variants of the “corrected” word at the output of the program). Apparently, hence the attempt to divide the SA Krylov programs for purist and laxist.

The "obsession" of programmers working in a team with professional linguists on the analysis of sentences is obviously connected with the lack of understanding of the principles of programming by linguists and the "imposition" of linguistic ideas on programmers. And here the respected S. A. Krylov demonstrates this - see a post on a well-known forum. This is a linguistic view, but not a programmer’s view, for which other questions remain important: whether it is possible to algorithmize a grammar rule or algorithmization is impossible, you should use a “vocabulary” approach to verify a word.
Editing the erroneous spelling of reflexive verbs is surprisingly easy in 40% (or more) cases, if you give up “what does, what to do”, and understand the reflexive verb exactly as it should, by meaning: reflexive proper; reciprocal; objectless return; general return & etc. In this case, the task of correcting a word is reduced to a) “preprocessing” processing of a phrase, a word; b) creating the simplest metalanguage that allows using certain descriptions for the “rules”, and this metalanguage will look like a classic stream editor, that is, a well-known class of programs.
')
So, let the "fuel" for the "preprocessor" we will have an array of seven (or less) of the last letters of the words nah and tsya (for example, take the dictionary Zaliznyak). If you take a larger number of letters, the "accuracy" will increase, it is obvious.
We will place the received data in an array according to the “herringbone roots in the sky” principle - this optimizes and speeds up the search, as well as eliminates possible errors (see the code).

If someone dares to repeat my experiment with Zaliznyak's dictionary, the result is hardly surprising: in such an array there will be only 3,548 endings (that is, seven or less of the last letters in a word) when it is uniquely spelled or typed. The number of endings, where the alternation of "ts" / "ts" is possible is just as small - only 407. Surprisingly, right? After all, now it is enough to “drive out” the tested word through the arrays and we will get rid of incorrect spellings of words like “seem”, “come”, etc. and the notorious "anagrams." (For the second array, when spelling variations are possible, you will need to use the "metalanguage".)

This is how the array “only one option is possible” (of course, these are several lines from 3548):

// "Herringbone in the sky"
For me, the erroneous spelling is before the delimiter, and then the correct one.
to look out :: look out
yachat :: yachting
yatatsya :: yata
go on
is here :: is found
go :: go
rtsya :: rtsya
Complete
get here

For me, the erroneous spelling is before the delimiter, after it is the correct one:

are yachting
is coming :: looking
ridya :: ridya
zeyutsya :: zeyutsya
see :: see
sound :: sound
to ring :: rut
oyutya :: oatsya
oetsya :: oetsya
come on

// Array for cases when it is impossible to determine the correct spelling (last, but obviously, you just need to insert a soft sign)

it is done
flakes
cramp
suck
is learning
merge
thinks
is about
stumble
sting
are
is eating
utsya
is
is it
is
atsya

For example, a simple code that allows you to search for matches in the databases:

string correction_verbs(string str) { //    map,   map    vector < pair < string, string > >data; vector < pair < string, string > >::iterator it; //     ,    . file_operations file_io; //          string_utilities str_ut; //  ,    replace_all string first_str, second_str, separator; //          string verbs = global::file_paths.find("verbs_cfg")->second; data = file_io.readf_vector_pair(verbs, separator); for (it = data.begin(); it != data.end(); ++it) { //     (  )    first_str = it->first; //   second_str = it->second; //  //     //   ,      ( ) if (str.find(second_str) != string::npos) { str = str_ut.replace_all(str, second_str, first_str); //    break; } } //   data.clear(); return str; } 


Despite the obvious solution that is often used in programming, for example, in the analysis of artificial languages, Orfo’s linguistic program presented on the market (not the most unsuccessful, perhaps even perfect in many ways) is unable to fully solve the task: it doesn’t have a “preprocessor” , instead of the “preprocessor”, the same notorious anagrammatic “distance calculation” algorithms are used, which inevitably “compelling” Orfo to propose some crazy versions of edits.

Look here: online.orfo.ru
We enter the phrase with the error: "we have to correct the error."
At the exit, we get, as expected, the following passage: dress up, have to, get used to, thread, get close.
Draw your own conclusions.
(Perhaps, a worthy example of how linguists "bent" programmers, having received an incomplete and illiterate solution at the output.)

Let's look at the work of the above preprocessor. In the variable str we will have the word "have to". In the array there will be only reflexive verbs “with unambiguous spelling”. At the output we get (the program is looking for matches by the last letters, starting from the top of the array) the word, where the view is definitely replaced by (see the database, there are no other options). If the program does not find matches in the array, the function will return the word unchanged. Of course, further should be checked in the database "tsya" / "tsya", using a certain set of descriptions. But I will write about “metalanguages” in such linguistic programs in the next post, so that the message doesn't get cumbersome.

PS Of course, the "preprocessing" processing has its drawbacks, but the program "thinks like a human being", but the output is still a more sane result.

Source: https://habr.com/ru/post/244905/


All Articles