Introduction to text search

Probably any programmer at least once in his life faced the task of finding a string in a string. Once I had to face it. Since then, this business is very fond of me. Not to say that I have achieved a lot in this, but I am not going to stop.
That's why I decided to write, but, in order to start more or less smoothly, I would like to make an introduction in the form of several introductory articles on the basics of text search.

When an “ordinary” person needs to find some subsequence of characters in an “ordinary” text, he almost always looks for approximately, that is, not necessarily an exact match, because the text may have an error, and in real text the form may be somewhat different. from the search pattern.
Thus, two types of search can be distinguished: Exact String Matching Algorithm and Fuzzy String Matching Algorithm.

Exact search

Exact search is simpler to implement, but the simplest one is brute force, called the brute force algorithm in accordance with the translation.
')

int Match( string input, string pattern) { int inputLength = input.Length; int patternLength = pattern.Length; for ( int i = 0; i <= inputLength - patternLength; i++) { bool success = true ; for ( int j = 0; j < patternLength; j++) { if (input[i + j] != pattern[j]) { success = false ; break ; } } if (success) return i; } return -1; }

The algorithm attempts to match the search pattern with a part of the input string of the same length starting from the zero character. If this attempt fails, the match is made from the first character, and so on.
brute force

Fuzzy search

What we have with a fuzzy search? In principle, a fuzzy search can be organized by slightly modifying an already written algorithm.
But first, a little theory.
Let us be given some integer k, and some function of the distance (distance) of the substring under consideration from the sample. We need to find a substring s such that distance (s) <k.

Hamming distance

The simplest option is to look for the remoteness of two lines of the same length - to determine the number of mismatched characters in the corresponding positions, or the number of substitutions to get an identical line, which is essentially the same. This option was proposed by Hemming, and therefore the remoteness function is called the Hamming distance.

Levenshtein distance

If you want to compare rows of different lengths, then you need to use the delete and insert operations.
There are also two options here.

The remoteness function or distance d is defined as the number of replacements, deletions or insertions necessary to obtain the desired pattern from the considered line.
d is defined as the sum of the prices of changes in a substring to get the desired pattern, where
- removal cost = 1
- insert price = 1
- replacement price = 2 (first delete the character for 1 and also for the price of 1, insert the desired one)

Fuzzy search can be used to search for possible fixes when checking spelling, the main thing is not to set the number k too small, or glitches are possible like in Firefox:

Let's try to implement a fuzzy search on the brute force algorithm. We change the inner loop so that for each substring it finds the Hamming distance. You will also need another parameter - k. And for this case, in my opinion, it will be more convenient to use real k in the range from zero to one. This number will show the maximum percentage of mismatches. Then the formula will be somewhat different:
distance (s) <= length (s) * k.
So, the implementation:

int Match( string input, string pattern, double k) { if (k < 0 || k > 1) throw new ArgumentException( "Invalid value for coefficient" , "k" ); int inputLength = input.Length; int patternLength = pattern.Length; for ( int i = 0; i <= inputLength - patternLength; i++) { int distance = 0; for ( int j = 0; j < patternLength; j++) { if (input[i + j] != pattern[j]) { distance++; } } if (distance <= k * patternLength) return i; } return -1; }

And a little refactor code for decency:

public int Match( string input, string pattern, double k) { if (k < 0 || k > 1) throw new ArgumentException( "Invalid value for coefficient" , "k" ); int inputLength = input.Length; int patternLength = pattern.Length; for ( int i = 0; i <= inputLength - patternLength; i++) { if (GetDistance(input.Substring(i, patternLength), pattern) <= k * patternLength) return i; } return -1; } private int GetDistance( string subString, string pattern) { int changesCount = 0; int patternLength = pattern.Length; for ( int i = 0; i < patternLength; i++) { if (subString[i] != pattern[i]) changesCount++; } return changesCount; }

Summing up, I will say that the presented algorithms are very inefficient, however, the topic does not pretend to be anything but an introduction to the text search.

Here you can download the archive of the project with tests.

ps Next, I plan to continue my undertakings in the form of a series “parsing for dummies”, which I am for now :)

Source: https://habr.com/ru/post/48249/

All Articles

Introduction to text search

Exact search

Fuzzy search

Hamming distance

Levenshtein distance

More articles: