Processing and classification of requests. Part Three: Correcting Typos

Misprints are sometimes useful for cheering the reader. Search engines appreciate humor while not able to, and the words typed with errors, lead them to confusion, which as a result distresses the user. To prevent these phenomena, there are automatic "correction" typos, they are spellcheckers.

More than enough has already been written about various approaches to correcting typos, so in this article I will not repeat what is already known, but I will show you how to write a spell checker from scratch - simple, but quite capable. All that is needed for this is a list of correct words and a bit of C ++.

')

I took the list of words here , but any other dictionary will do. We will store it in the prefix tree (it is Trie). Here is its somewhat exotic , not quite optimal , quite dangerous , but perhaps the most concise implementation:

struct trie_t: map<char, trie_t> { }

The trie_t structure inherits all map methods. The keys of this map trie_t letters of words, and the values are again trie_t structures that inherit ... and so on. Somewhat meditative, but for the prototype it will do.

Add a node weight to the structure, it will come in handy later:

 struct trie_t: map< char, trie_t > { size_t w; trie_t() : w(0) {} }

Write the words in the tree will be using this function:

 trie_t & add( trie_t & t, const string & s ) { t.w++; return !s.empty() ? add( t[ s[0] ], s.substr(1) ) : t['$']; }

It works like this: we take the first character of a word, write it into the key of the map, we get a link to the corresponding subtree (if necessary, it is created right there). Recursively write the rest of the string to this subtree. When the string is exhausted, we add to the last node a “stop sign” - the key ' $ ' with an empty subtree.

After reading the words from the specified file, we will add them all to the dictionary:

 trie_t glossary; ifstream fi( argv[1] ); string word; while( getline( fi, word ) ) add( glossary, word );

As a result, the number of words that have passed through this node will be recorded in the w field of each tree node.

What can you do with all this? You can print the contents of the tree by sorting the keys and subtrees of each node with the inherited iterator:

 void print( const trie_t & t, const string & prefix = "" ) { for( map<char, trie_t>::const_iterator it = t.begin(); it != t.end(); ++it ) if ( it->first == '$' ) cout << prefix << endl; else print( it->second, prefix + it->first ); }

You can determine the presence or absence of a given word in the tree:

 bool is_known( trie_t & t, const string & s ) { return !s.empty() ? is_known( t[ s[0] ], s.substr(1) ) : t.find('$') != t.end(); }

In this function, although it works correctly, for the sake of brevity, an error has been made. I will not correct it, since we will no longer need such a “clear” search.

Fuzzy search

If you imagine a tree as a certain repeatedly forking road, then the function of searching for a word is a traveler walking along this road. The traveler is given a route (search word), and in accordance with it he moves from node to node, crossing out the passed letters of the route. If the letters run out, it remains to check whether the “stop sign” is visible.

Now imagine that the traveler, as it were simultaneously sent in all directions at once. Each of its clones at each fork behaves in the same way, and so on, until someone comes to each of the terminations (that is, the leaves of the tree). The total number of travelers will be equal to the number of words recorded in the tree, and each of them will pass along its own individual trajectory.

Let all of them were originally given the same word route. At each transition, each traveler, as before, crosses out one letter in his copy of the route. But if the real direction of the transition does not coincide with the crossed out letter, the traveler receives a penalty. As a result, at the end of the journey, each person will have some debt accumulated. Line up all in a row, sorted in ascending order of this magnitude.

If the search word is present in the dictionary, then one traveler will go all the way without penalties - he will be the first in the sorted list. If there is no word in the dictionary, then the leader will be the one who has traveled with a minimum number of violations.

Actually, this is almost the entire algorithm. The trick is that the path traversed by the leader will be the closest correction of a given word (or the word itself if there are no typos), and the number of violations (deviations from a given route) is the number of typos.

On your marks

It is more correct, of course, to call the road a tree, forks - knots, travelers - hypotheses, and deviations from the route - editing operations. But let them remain as they are, somehow livelier. We introduce a structure storing all information about the traveler:

 struct rambler_t { string todo; string done; size_t cost; const trie_t * road; rambler_t( const string & t, const string & d, const size_t c, const trie_t * r ) : todo( t ), done( d ), cost( c ), road( r ) {} };

The fields store the following:

todo is the way to go. Initially, this is the travel route, that is, the word to be corrected. At the end there is an empty line.
done - the passable part of the path. At the beginning - an empty line, at the end - a list of passed nodes, that is, the corrected word.
cost - the amount of fines imposed. At the start - zero, at the finish - as it will.
road - the location of the traveler. In the beginning - the root of the tree, in the end - one of the leaves.

So, the traveler is at the starting point, you can go:

 rambler_t R( word, "", 0, &glossary );

First step

All possible directions can be iterated with the help of an iterator, which is already useful when printing a dictionary:

 for( map<char, trie_t>::const_iterator it = R.road->begin(); it != R.road->end(); ++it ) { char dest = it->first; //   const trie_t * road = it->second; //   //     }

After completing the step, the values of the structure fields will change as follows:

 R.done = R.done + dest; R.todo = R.todo.substr(1); R.road = R.road[ dest ];

There are two options with payment. If the chosen direction coincides with the “general line”, then the transition is free, otherwise +1:

 R.cost = R.cost + ( dest == todo[0] ? 0 : 1 );

Ok, the first step is done. Rather, a lot of first steps at once - instead of a lonely wanderer, a whole team appeared:

 typedef vector<rambler_t> team_t;

It remains only to repeat step by step, increasing the team until everyone reaches their destination point ... Stop. There is a nuance, and not one.

The above “turn is not there”, corresponding to the replacement of the letter of the word by another letter - only one of the varieties of typos. There are at least two more: deletion and insertion.

Deletion

This is the situation “paid, but did not go”: we delete the first letter from todo , self-fined, but do not make the transition and do not add anything to the done .

 R.done = R.done; R.todo = R.todo.substr(1); R.road = R.road; R.cost = R.cost + 1;

As a result, the letter from todo simply lost along the way, which is equivalent to its removal.

Insert

Here, on the contrary, “running on the spot”: we move deep into the tree, add a letter to done , but do not cross out anything from todo .

 R.done = R.done + dest; R.todo = R.todo; R.road = R.road[ dest ]; R.cost = R.cost + 1;

The completed letter in done arises out of plan, from nowhere, which is tantamount to inserting it.

Fathers and Sons

Since at each step there is some ~~parallelization of the worlds of~~ an increase in the number of objects, we will not modify the parameters of the traveler, but create new ones in his image and likeness. The following function unites everything: it receives as input to the traveler, filling out two vectors of travelers at the exit in accordance with the given route, the keys of the tree node and the rules described above.
Those who will continue to move into the team vector, and finished to the finished , that is, those who have todo empty, and the current tree node contains the end-of-word ~~bucks~~ .

 void step_forward( const rambler_t & R, team_t & team, team_t & finished ) { char next = R.todo[0]; //  const string & todo = next ? R.todo.substr(1) : ""; for( map<char, trie_t>::const_iterator it = R.road->begin(); it != R.road->end(); ++it ) { const trie_t * road = &it->second; //  char dest = it->first; //  if ( next == dest ) team.push_back( rambler_t( todo, //    R.done + dest, R.cost, road )); else { team.push_back( rambler_t( todo, //  R.done + dest, R.cost + 1, road )); team.push_back( rambler_t( R.todo, //  R.done + dest, R.cost + 1, road )); if ( !next && dest == '$' ) finished.push_back( R ); // ! } } if ( next ) team.push_back( rambler_t( todo, //  R.done, R.cost + 1, R.road )); }

Natural selection

And the second nuance: by generating an avalanche of new and new participants at every step, we act irrationally. Crowds idly wandering about a tree are poorly managed, most of them are useless and only devour resources. We introduce restrictions and arrange a competition for them - let them run.

First, we will assign the maximum allowable amount of fines, and we will throw out all those who exceeded it, deviating too much from the given path.

Secondly, we will drop all those who are lagging behind. True, there is a risk that among the exiled outsiders, someone could sprint to catch up with the leaders at the very end of the race, but for the sake of optimization you will have to put up with it.

In order to identify those lagging behind, let us estimate each person’s chances for success, that is, that the rest of the way will be passed without disqualification for exceeding the maximum amount. Chance will be estimated as follows:

 double ramb_chances( const rambler_t & a ) { return log( ( a.road->w + 1 ) * ( a.done.size() + 1 ) ) / ( 1 << a.cost ); }

Why is there a logarithm? What are the bit shifts? Why is the function exactly like that? You can try to explain. Obviously, the greater the amount of fines already received, the less likely to reach the end. And the more words in a subtree, the higher the probability of finding at least one suitable among them. And the length of the path already traveled, as it indicates “endurance”.

It must be admitted that the latter is very far-fetched by the ears. But in any heuristics the main thing is to work. This one works. And, anyway, we are nearing completion. Denote the maximum amount of penalties as max_cost , the maximum size of a team as team_size .

Put the first mover at the beginning of all the roads:

 team_t walkers, leads; walkers.push_back( rambler_t( word, "", 0, &glossary ) );

We make a step by generating the next “generation”:

 team_t next_g; for( size_t i = 0; i < walkers.size(); ++i ) step_forward( walkers[i], next_g, leads );

In leads who complete their path are written, the next_g are not reached yet. We sort still not descending their chances to reach:

 sort( next_g.begin(), next_g.end(), ramb_chance_cmp );

And we repeat all this until there is still someone to move on. Ignore those whose fines exceeded max_cost , and those who are not included in the team_size best:

 while ( !walkers.empty() ) { team_t next_g; for( size_t i = 0; i < min( walkers.size(), team_size ); ++i ) if ( walkers[i].cost < max_cost ) step_forward( walkers[i], next_g, leads ); walkers.swap( next_g ); sort( walkers.begin(), walkers.end(), ramb_chance_cmp ); }

By the way, leads may have several structures with the same done lines. Guess where they are from?
Leave only the unique elements:

 sort( leads.begin(), leads.end(), ramb_done_cmp ); //   done leads.resize( distance( leads.begin(), unique( leads.begin(), leads.end(), ramb_uniq ) ) ); //   sort( leads.begin(), leads.end(), ramb_chance_cmp ); //

That's almost all.

Add all of the above into the spellcheck function, pass our word into it, the maximum penalty and the number of candidates. You can use:

 while( getline( cin, word ) ) { team_t fixed = spellcheck( word, word.size()/2, 512 ); cout << word << endl; for( size_t i = 0; i < fixed.size(); ++i ) cout << '\t' << fixed[i].done << ' ' << fixed[i].cost << endl; cout << endl; }

The limit of 512 candidates and the maximum deviation in half the length of a given word I picked up manually. With such parameters, the algorithm's performance is approximately 10 words per second; it copes with the words , , and the , but does not turn into .

Now absolutely everything. Let's get started, run. And here is an example of the work of our simple but effective mechanism for correcting typos:

   2    3   3   3   2  2  2

It is evident that not all of his results are perfect, but this is fixable.

Results of the way

I'm sure many people have learned in the description of one of the varieties of an informed search - the A * algorithm , or rather, its variation - the A * algorithm with iterative deepening and restriction on memory. Of course, the described algorithm is only the first, rough approximation to what is actually used in battle.

What else can a real combat spell checker know?

In our reference dictionary there are only the original forms of words, so he will correct the on the , which is wrong. It is necessary to have an idea about morphology, or to include all forms of words in the dictionary. The first is better, but harder.

All words are equal. This leads to inaccuracies in choosing the best hypothesis:
```
   1  1 
```
The problem will be solved if you add information about the frequency of use of words.

All errors are equal. Hence the uncertain results:
```
   1  1 
```
Corrected by grading penalties in accordance with the probability of each possible error. Replacing more likely than per , so a more likely.

Punto Switching. It's simple. Add a table of correspondence of the form q- , w- , a- and one more type of typo - change of the layout. Is done.

Gluing and sticking words. This is somewhat more complicated. It is necessary to teach the “travelers” to jump from the leafy nodes of the tree to its root and separately handle the gaps in their “routes”.

Context fixes. The same words in different situations should be corrected in different ways. For example, and , and , a and a . To correct such typos, statistical data and / or grammar rules are needed. In other words, the language model. This is another story.

Side effects

It would be wrong not to mention the possibilities of the described approach that are beyond the scope of our task.

If you cancel the penalties for adding characters to the end of the specified "route", then we get the solution to the problem of auto-completion .

If we insert into the program a table of phonetic correspondence of symbols, we obtain a system for the recognition of transliteration of any degree of ambiguity. The algorithm will find the most likely word.
```
  → j, zh, g, dj j → , , ,  ... 
```

Tree transitions according to the rules given by the correspondence of numbers and letters will turn the spellchecker into the T9 set system:
```
  2 →  ... 9 →  
```

The cost of the error, calculated not on the probabilities of misprints, but solely on the distances between the keys (or rather, their images on the touch screen), forms the basis of the system of “sliding” word entry - Swype .

The task of fuzzy search is part of speech recognition systems, handwritten texts ... However, enough. This has already been written very much indeed.

The full text of the program can be found here . The program is slightly longer than Peter Norvig's famous nanosellchecker, but it is far superior in its capabilities, not inferior in performance.

Thanks for attention!

Useless links

Inspiration Source: Online Spelling Correction for Query Completion
The world's shortest spell checker by Peter Norvig
1200 ways to write the word "classmates" - can be used to calculate the error model
Alternative approaches: one , two , three

Mikhail Dolinin,
search request manager

Source: https://habr.com/ru/post/178401/

All Articles