📜 ⬆️ ⬇️

NLP: spell check - an inside look (part 1)

Those who read my previous publications know that I write quite rarely, but usually in series. I want to collect my thoughts on a given topic and put them on the shelves, without squeezing myself into the Procrustean bed of one short article.

This time there was a new reason to talk about text processing (natural language processing, that is, I mean). I am developing a spelling checker for one office. The output should be similar to the functionality built into MS Word, only better :) I can not yet call myself an expert in this field, but I try to learn. In the notes I will try to tell about where our project is moving, how this or that text processing stage is arranged. Maybe in the comments I will hear something new / interesting for myself. If the project will benefit from this - fine. At a minimum, I’ll fix the data in my head, and that’s not bad either.

On the shoulders of giants?

It is clear that without regard to existing solutions it is difficult to invent your system. However, the giants around us are somehow not particularly observed. There is MS Word that we all know, and also ... and who else? Let the commentators correct them, but apart from the LanguageTool module for Open Office (we'll talk more about it), nothing even comes to mind. Piece goods. (Yes, I also remembered the Grammarian Pro X package for the Mac, but it also doesn’t do any weather). Accordingly, it is difficult to focus on the "fathers". Spell check is at least a lot of where implemented, but with the grammar quite a disaster.
')

Compilation vs. static analysis

In programming languages, two models of detecting errors in the text are clearly visible. First, errors can be identified at the compilation stage, that is, when trying to combine words of a language into meaningful ones that are permissible according to the language grammar of the structure. Secondly, it is possible to make a static code analysis, that is, to find in the program text some patterns associated with potentially dangerous actions.

In theory, the “compilation model”, of course, looks very tempting: we will try to “compile” the text. If there are errors in it, the analyzed fragment simply “does not stick together”, and the system will immediately understand why - as it is clear to the compiler of a computer language. Unfortunately, at the moment there are no full-fledged “natural language compilers”. This is exactly the direction I’m digging at leisure, but I’m not ready to try to incorporate raw ideas into a commercial product. It is better to make a good state-of-the-art module, at the same time to understand how it works in our times.

If you open the spelling settings in MS Word, you will see that grammar checking works exactly according to the principle of a static analyzer. There is a certain set of checks, and the system consistently runs text through them:



In truth, it’s not quite correct to talk about “spelling module in MS Word”: in fact, modules for different languages ​​were made by different teams and on different algorithms. However, the general idea of ​​“running” text through a sieve of checks seems to be valid for each module.

Under the canopy of a lantern

And now let's discuss this important question: where do the very checks that we just talked about come from? Why is MS Word built in exactly the set of rules shown in the screenshot above? By the way, in the help more detailed information is available for each type of analysis:



The quality of grammar checking Word didn’t kick just lazy. It is enough to study at least this well-known collection of materials to make sure that many people share your negative experience :) I think the shortcomings of all grammatical modules are caused by three main reasons. First, the very principle of “static checking” implies incomplete coverage of spelling errors. The name of these errors is legion, and one must have the remarkable talent of a tester in order to drive into the system all the conceivable and inconceivable absurdities that are potentially possible in sentences. Secondly, our technologies are not as good as we would like. The tester realizes many errors, but does not have the ability to program them. It is clear that not all errors are equally easy to catch. Thirdly, it seems that errors are sought in accordance with the well-known joke - “under the lamp”, where it is light, and not where they really are found.

To date, it is not so easy to find the frequency list of errors encountered in ordinary correspondence. One of the few studies lists the twenty most common mistakes seen in student essays (in English). We are talking about native speakers, so the list may not seem obvious to us. I think for foreigners we will get a completely different sample (moreover, it is strongly dependent on the native language of the writer).

The author of another article was not lazy and drove the texts with the indicated errors through various grammar checking modules. The results were completely disappointing. In short, everything is bad (and for some reason Word 97 turned out to be much better than all subsequent versions; however, it doesn't matter to us). The test of the most popular bugs is either too complicated to program or simply missed by developers due to an oversight.

Our module as a mirror of the current state of the region

Of course, the customer wants to have the best grammar module in the world. At least, no worse than MS Word. And we will try to provide it to him, but, to tell the truth, I do not hope to go far from the current quality standards. Too much plays against us. Indeed, there is no good classification of possible errors with indication of their real frequency (both for carriers and foreigners), and without such a list, any checks turn into firing at a gloomy sky in the hope of getting into the game somewhere flying by. Yes, and errors from the list (I carefully studied them) are really mostly difficult to catch. Well, we will work. Yes, I have not mentioned so far that we begin with English, then the plan is German, and there life will show.

In the following parts I will turn to the technical details of how our system works (at the moment, for the most part, it exists only in my head, but the process is moving fast), and today I propose to finish it.

Source: https://habr.com/ru/post/108831/


All Articles