(
Part 1 ) Today we will talk about the levels of understanding of texts by our system, about which spelling errors are easy to catch, which are not very easy, and which are extremely difficult.
To begin with, the text can be viewed from two points of view: either as a simple sequence of words, spaces, and punctuation marks, or as a network of related concepts and syntactic-semantic dependencies. Say, in the sentence "I love big dogs" you can arrange words in any order, while the structure of the connections between words will be the same:

')
Errors can also occur at different levels. You can make a mistake in the linear structure - type the same word twice, forget the point, the parenthesis, and the like. That is, an error occurs in the process of building words into a chain, and a person is unlikely to make it out of ignorance of any grammatical rule. Probably, we are talking about a simple inattention. However, at the level of the chains, some real grammatical errors can be identified. For example, in the English text there should be no combination of “was” - there should be either “was” or “has been”.
But still, real grammar begins at the level of analyzing the network of concepts. For example, in our example about dogs, the subject “I” should be combined in person and number with the verb “I love”, regardless of the relative position of these words in the sentence. Of course, this consideration does not negate the need to catch even simpler errors identified at the level of the linear structure.
First levels
It is reasonable to start error checking with the simplest. If something can be caught at the level of a trivial search for substrings, why connect heavy artillery? The simplest functionality is the AutoCorrect list, which also exists in MS Word. We are talking about obvious typos, corrected by the system automatically, even without user confirmation: abbout -> about, amde -> made, compleatly -> completely. Perhaps the autocorrect seems to be such an obvious function that it seems to make no sense to talk about it, however, as an active MS Word user, the autocorrect often helps me, and I will be happy to see the autocaution in other word processors. For this we work, in the end.
Slightly more complicated are simple rules based on regular expressions that treat the text as one big line. Today we catch simple situations with regular expressions associated with typographical roughnesses: space (s) between punctuation marks, extra spaces between words, non-standard combinations like "!?" and so on. In principle, even at the level of a simple search for substrings, you can find a number of errors skipped by the spell checker.
Offers and tokens
Now we turn to more interesting things. In order to analyze the text precisely as text, and not as a string of characters, it is necessary to learn how to select structural elements in it. We are still far from the structure of connections between words, so let's start with the simplest thing - recognizing the boundaries of sentences. Why do we need it? Well, firstly, there are mistakes that are characteristic for the beginning and the end of a sentence: you forgot to start with a capital letter, you forgot to put an end to the end (dots, question / exclamation marks). There are less obvious cases - for example, it is stylistically unacceptable to begin a sentence with a number written in numbers. Secondly, without the existing sentence structure, it is impossible to proceed to the next stage of the analysis — to identify the links between the words of the sentence.
Like everything else in our world, at a second glance the task of breaking up the text into sentences no longer seems so simple. The obvious algorithm is to find the final punctuation mark followed by the capital letter. At the same time, we will be mistaken when we meet the name of a person: “As M. Ivanov stated ...” Similarly, there are cuts in the way at which there is a period. In principle, you can add a list of names and abbreviations to the analyzer, but this solution will still not be without flaws. For example, it will have to be completely rewritten for any new language with its own rules for ending sentences. In addition, there is an obvious problem: we proceed from the fact that the input text contains errors (in their correction and the essence of our module); so how in the conditions of an incorrect input to break the text into sentences? In this case, we automatically lose the opportunity to see the error at the junction of sentences. If we assume that the border lies between a point and a capital letter, then an error like “forgot to put a capital letter” will immediately be out of reach, because the system simply does not understand that there is a sentence boundary in this place.
Now the most popular approaches to splitting text into sentences in real conditions are to use machine-based classification algorithms.
Sentence splitter
In short, the task of classification is to determine the correct class of an object C based on its attributes (A1, ..., An). The input to the learning algorithm is given a large sample of known objects, on the basis of which it forms its own idea of ​​how attribute values ​​affect the belonging of an object to a particular class. Then you can feed the algorithm a set of attributes of an unknown object and get its most likely class.
A textbook example of the classification task is the determination of the type of iris flower based on four parameters - the length and width of the petal and the sepal. There are three types of iris: Iris setosa, Iris virginica and Iris versicolor. The Iris dataset lists 50 entries of the following form:

The classification algorithm can study this data and build a model for matching the attributes of an iris to a particular type. If an iris unknown to me grows in my garden, I can measure its parameters and ask the algorithm what kind of iris I belong to.
The qualifier can also be used for other purposes: making decisions based on a table of reference decisions under the circumstances, as well as identifying hidden patterns - another textbook example involves studying the “attributes” of passengers who survived the Titanic crash in order to understand what the chances of different groups of people (child / adult, male / female, passenger / team member, the owner of a 1/2/3 class ticket).
There are a lot of different classification algorithms: decision trees, nearest neighbor, bayesian network, maximum entropy model, linear regression ... Each has its own advantages and disadvantages. Some work better with numerical data, others are easier to program, others are faster, fourth formulate the identified classification rules in an easy-to-analyze way.
In relation to the division of text into sentences, the classifier works as follows. At the entrance comes the text, divided into proposals manually. The system studies the "attributes" of each end of the sentence (in fact, it looks to be left and right of the border) and builds a classification model. Now the algorithm can input any point of the text and ask if it is the boundary of the sentence.
What are the subtleties here? First, here we have a not quite standard formulation of the classification problem. We only get the end-of-sentence contexts at the input:
(A1, ..., An) -> (end of sentence)
In the classical formulation, the knowledge base should include all variants, ie:
(A1, ..., An) -> (end of sentence)
(A1, ..., An) -> (not the end of a sentence)
In our case, cramming into the table all the examples of non-ends of the sentence is too ruinous - the base will grow incredibly. Apparently, for this reason, the most frequently cited author of the machine learning scheme (as applied to our task)
A. Ratnaparkhi used the principle of maximum entropy. This model allows you to simply ask the probability that an object belongs to a given class without regard to other possible classes. In other words, we ask the model what the likelihood of a given context to be the end of a sentence. If the algorithm responds that the probability is higher than 1/2, you can mark the context as a sentence boundary.
I think it makes sense to try other classification algorithms. As far as I know, this was not done; if my hands reach, I'll do it. Experiments of the same Ratnaparkhi show the accuracy of its algorithm in the region of 98%, that is, from one hundred ends of sentences he guesses 98 correctly.
Unfortunately, in the spell checker we are again confronted with the fact that the input text may contain errors. If you train the model on correct texts divided into sentences, the computer will rightly decide that the sentence always begins with a capital letter. If we throw away the “title” from the attributes taken into account, the accuracy of the model will fall. You can manually make a few errors in the reference texts (in some places “forget” the point, in some places replace the upper case letter). Further, in the Ratnaparkhi system, we must first find the potential boundary of sentences, and then ask the system about its opinion on this place. He does this simply: we look for a point, an exclamation point or a question mark - and ask what it is. With us, the user can forget about the point - and what to do?
Today, in addition to the punctuation marks, I also check newline characters (from personal experience - if I forget to put a full stop, it’s at the end of the paragraph). You can try to learn all the spaces between words, but I'm afraid that the accuracy may fall. In general, there is something to think about :)
Okay, enough for today, we will continue next time.