The topic voiced in the title of the article is not new. On the Internet, you can find many questions on how to implement it, but there are a few less answers. And it is not rare that they boil down to tips on using third-party products, for example, Sphinx. But often in the use of such bulky add-ins is not necessary.
Not so long ago, the following task arose:
- There is a database of actors, musicians, DJs, as well as sometimes politicians, scientists and other public figures in the information system (IS).
- The IP was filled with amateurs and amateurs, many of whom found it difficult to correctly translate this or that surname into Russian. And sometimes it happened and such that the name sounded in Russian at all as it was written in a foreign language.
- In the database, the names were entered in a fairly free form. The order of the words in the name often changed (especially for Asian names), nicknames and pseudonyms could additionally be indicated (or not indicated).
- There were also just clerks - replacements, omissions or adding letters. Sometimes, for some reason, some of the letters were written in Latin letters.
- Fortunately, there were no incident symbols.
- The names were stored in MyISAM tables in MySQL in UTF-8 encoding.
- The total number of names in the database was approximately 138.5 thousand.
This is what is "given." A "required" the following:
By automatically viewing the database, we can identify and form pairs of names that are similar in some criteria, which can be transmitted for further manual analysis to determine whether they belong to the same person or different. Moreover, the search was supposed to have some paranoid. Those. in contrast to "
Perhaps you meant ... " had to give out a few more names on the principle "
Better to perebdet than short time ." And, therefore, a somewhat larger number of pairs should have analyzed for similarity. And, to do it not in real time, but for a really finite time. That is, not in a day or a week, but, in the worst case, in a few hours.
In general, if briefly, there are almost one and a half hundreds of thousands of names, nicknames, pseudonyms without division into separate fields (last name, first name, middle name, etc.) with an arbitrary word order, typos, errors, with different spellings. Moreover, the names have a variety of national affiliation - Russian, English, Jewish, Polish, French, Korean, Chinese, Japanese, Indian, Indian, etc. It is required to find among them all sorts of pairs of such that strongly resemble the same name.

')
At first, the task seemed extremely difficult. To me, a person unfamiliar with a fuzzy search, it was difficult even to imagine formal criteria that allow judging the similarity or difference of names.
But gradually began to form an understanding of the approaches and algorithms used for this purpose. First of all, these were algorithms specifically targeted or simply well-proven when searching for spelling errors in names:
- Phonetic coding (in particular, Russian Metaphon by Peter Kankovsky and its modifications [1] ).
- A measure of similarity based on the Levenshtein editing distance [2] .
- Selection of common forms [3] .
- Jaro-Winkler algorithm [4] .
- N-gram analysis. [5] .
According to common sense, the following was taken:
Phonetic coding is not suitable due to the specifics of filling the database. Phonetic search is able to find names that have been recorded with errors in the ear, but not translated by amateur amateurs. Let me explain with an example:

Here comes the captain of the third rank, Jeannette Devereaux
[6] , to our native passport office and is presented clearly and loudly according to the charter. And the passportist can write down her name and "Diver" and "Deviryu" and even "Divyra" by ear. Yes, double the consonants in the name to skip. All these "auditory hallucinations" are perfectly tracked by phonetic coding fuzzy search algorithms.
But here's a pseudo "translator" who does not know the specifics of the French language (but who is familiar with the English language), while rewriting data from a military card, can easily write "Jennette Deverox". Those who are especially gifted can indicate their callsign or write their last name in front of their first name.
Or even many English-language “translators” do not know that the name “Ralph Fiennes” should not be read “Ralph”, but “Rafe Fiennes”.
So, phonetic coding is not suitable here, because in this case, erroneous names are not written “like hearing
CC a”, but “like the type of
RTCR ”.
An attempt to apply phonetic algorithms to names that are perceived not by ear, but by sight, can lead to the fact that the phonetic indexes of names will differ from each other even more than the way they are written.
So maybe use edit distance?
It would be a good idea if it were not for the fact that these metrics are mostly designed specifically for comparing words, and not phrases with an arbitrary word order. It is necessary to change the name and surname in places or insert a pseudonym between them, and they show an extremely unsatisfactory result.
It is necessary in this case to break the names into separate words and compare each of them already. And this significantly complicates (and slows down) the work of the algorithm, even if you first bring these individual words to their phonetic index. Also, the added complexity is that the number of words in a name is non-constant. Taking into account possible omissions and word permutations, it is already difficult to even imagine the laboriousness of this process.
Other methods
The disadvantages of the allocation of general forms of the metrics are the large time complexity of calculating the relevance with unclear indexing possibilities, especially short names. Even a single selection of the maximum common substring has the time complexity O (nm). And since it is necessary then to continue the recursive search for common substrings in the residuals, the calculation of the relevance function is a very expensive operation, losing even to the N-gram comparison.
Again, when changing the words of places, relevance decreases sharply.
Jaro-Winkler distance calculation is significantly faster. But, unfortunately, it is also not resistant to changing words in places.
So, in order not to have to separate and combine words in the name, willy-nilly, we had to stop at the N-gram analysis. Or rather, even on trigramm, because short Korean names do not have the effect of highlighting too long N-grams.
Stage one. Front impact.
So, to begin with, in order to get an idea of ​​the time complexity of a trigram comparison, I wrote the simplest version of a script that compares names in pairs across the database.
The script was written and tested at home on an eight-year-old computer, with a single-core Intel Pentium D 925 on an ASUS P5B-E motherboard and with ordinary Seagate hard-core hard drives of about the same age.
This script worked, as they say, in the forehead. I read the database, each name, previously normalized, divided it into unique trigrams and counted the number of their occurrences in each of the previous names (that is, being in the database of the previously tested person). If the result of dividing twice the number of matches by the sum of unique trigrams in each of these names exceeded a certain threshold (for example, 0.75 or 0.8), then this was recorded in the file.
The normalization of the name was as follows:
- English letters and some numbers were replaced by similar Russian letters.
- All lowercase letters were replaced by uppercase.
- The letter "E" was replaced by "E", and "b" by "b".
- All other letters, signs, non-printable characters and their sequences were replaced with two spaces.
- The name was framed with double spaces.
As a result of this normalization, a simplified name with no punctuation marks, no transliteral characters was obtained, in which each word is framed or separated from other words by double spaces.
The use of double spaces between words, as well as at the beginning and end of the name, made the trigram analysis algorithm completely insensitive to shifting words.
In the normalized name, various unique trigrams were allocated - combinations of three characters arranged in a row (including spaces).
Then, for two names, the following relevance coefficient was calculated: the double number of common trigrams was divided by the sum of unique trigrams in each name.
The requirement for the uniqueness of trigrams is announced for a reason. On the one hand, counting the total number of trigrams in names without taking into account their uniqueness will allow us to confidently distinguish such names as "John Smith" and "John John Smith". And taking into account only unique trigrams will lead to the fact that these names will be considered identical.
On the other hand, firstly, the need to distinguish such names will not occur often. And, secondly, the calculation of relevance for all trigrams, and not only for unique ones, will significantly complicate this procedure. Because instead of the doubled number of common trigrams, it is necessary to count the sum of entries of trigrams from the first name to the second and trigrams from the second name to the first. That is, roughly it will increase the time spent on the calculation of relevance by about 2 times. Well and, thirdly, it will not allow to apply to the procedure for calculating the relevance of some optimization, which will be discussed further.
So, the frontal algorithm was written and tested on the first few thousand names. The result was depressing. As was to be expected, the time for finding matches across the entire array of names had a pronounced quadratic dependence on its size. The trend forecast for processing the entire array of 130-140 thousand names was about two years. It will be a bit much!
But there was an initial assessment of the performance of the algorithm.
Stage Two. Search for ways to optimize.
First of all, I had to evaluate which elements of the algorithm need to be optimized, and which one is the notorious “bottleneck” that brakes the entire system. There were several areas of optimization:
- The initial sample from the database should not give out the entire array of names, but only those that are already at least slightly similar to the one being tested, to reduce the number of other operations. The analysis of the execution time of different sections of the code showed that the queries to the database are a bottleneck. It was necessary to reduce both the execution time of each request and the number of issued candidate names in each sample.
- Comparing and copying substrings performed with UTF-8 strings in multibyte encoding are several times lower in performance than those of single-byte encoding strings.
- It is not necessary to make an exact calculation of the relevance coefficient if the share of total trigrams is small. In this case, it is possible to interrupt this process as soon as it becomes clear. that even if all the other trigrams are common, this will not result in exceeding the required boundary value.
- Other small things.
Stage Three. Implementation.
First of all, as the simplest, an attempt was made to transition from the encoding of normalized UTF-8 names to the encoding of Windows-1251. However, it turned out that MySQL is very jealous of storing strings in a database in different encodings, even in different tables. That leads to nonlocalizable errors during database operations. Therefore, the normalized name is still stored in an additional attached table in UTF-8 format, and, if necessary, is recoded in Windows-1251.
Storing a normalized name significantly saves time, because normalization is necessary for each entry only once.
In addition, the number of unique trigrams in the normalized name is stored in the same joined table. This value can be used to force the completion of a relevance calculation in the event that too much difference between the names becomes apparent. In this case, to calculate each time the number of unique trigrams in the name is no longer necessary.
And, most importantly, the processed names are indexed with a trigram index. Those. in the additional table, all unique trigrams present in the normalized name are listed. At the same time, the trigrams themselves, in order not to store them in the “slow” UTF-8 encoding, were stored as an integer hash - the three lower bytes of the number corresponded to the numeric values ​​of the corresponding characters in the Windows-1251 encoding.
But at the same time, the approach was completely ineffective, as it was found out later, the approach - for each trigram from the checked name, its own selection of the names of candidates was made, which were then combined into a single array. In addition, each name in this array was accompanied by a counter — in how many samples it was encountered. It was assumed that this would give some gain in the calculation of relevance due to the fact that it would not be necessary to then decompose these names into trigrams - it was enough to use this counter to immediately calculate the relevance.
Unfortunately, multiple sampling from the MySQL database (to a greater extent) and the formation of a combined array of names (to a lesser extent) gave such overhead costs that many times exceeded the gain obtained by simplifying the calculation of relevance.
As a result of the described optimizations, a significant reduction in time costs was achieved:
The use of a trigram index for the initial sample reduced the resulting time prediction from two years to less than three months. Using numeric values ​​instead of substrings in the UTF-8 format allowed us to further reduce the overall processing time by another 10-12%. But still, 2.5 months of uninterrupted work seemed too long. Subtle place remained the initial selection of similar names.
Stage Four. Further improvements.
Repeated access to the database for each trigram index was not the best idea. Not only that numerous samples were performed for a long time, so also the consolidated array was very large, and its processing took considerable time despite the use of the total trigrams counter for each element.
I had to abandon the idea of ​​this counter, but receive the entire initial set of candidate names for a single query using the WHERE IN SQL construct. This made it possible to reduce the forecast a few more times to one month.
Further, the procedure for calculating relevance was changed. Based on the lower bound of relevance and the number of unique trigrams in the names, the maximum allowable number of “misses” was calculated - the number of trigrams in the name of the applicant, not included in the checked name. Then all the trigrams of the name-applicant were checked for entry into the name being checked. As soon as the number of “misses” exceeded the maximum allowed, the comparison stopped. This has reduced the forecast time by a few percent. But it was still very far from industrial values.
Stage Five. Decision.
The task seemed intractable. On the one hand, it was necessary to significantly reduce the number of names in the initial sample, but on the other hand, this reduction should not have been too artificial not to accidentally cut off names that are close to the lower limit of the allowable range of relevance.
Finding an approach was helped by simply viewing the found pairs of similar names. Practically in all such pairs there were common subsequences with a length of 6-7 characters (including bordering double spaces). By simple reasoning (which is not given here because of their lack of rigor), it can be concluded that in order to achieve a relevance of more than 0.75, normalized names must have a common substring of 5 characters or longer.
Thus, from the indexation of the preliminary search with trigrams, we proceed to the indexation with pentagrams (5-character substrings). To fit the pentagram hash in an integer variable that has a size of 32 bits in the 32-bit version of PHP, we represent each character as 5-bit, the benefit of all 31 characters (without “” and “”) and a space.
Another small addition.
As I wrote above, when calculating the relevance, unique trigrams of one of the names were compared for matching with the trigrams of the second name. In this case, the calculation was stopped when the number of “misses” exceeded the maximum permissible amount.
If the number of unique trigrams in the first name is significantly less than the number of unique trigrams in the second, then the calculated maximum allowable number of misses may be negative. This means that these names can by no means be similar to a given relevance value. And there is no need to check them.
But this cutoff does not work in the opposite case, if the number of unique trigrams in the first name is much greater than the number of those in the second name. Therefore, additionally, the boundary values ​​of the number of unique trigrams in the normalized name were entered into the sample query.
This is a somewhat controversial improvement. When setting the threshold value of relevance 0.75, it does not lead to acceleration, but, on the contrary, leads to a loss of productivity by 3% due to an increase in the complexity of the sample. But if we set the relevance limit of 0.8, then we get already 3% of the winnings.
However, this technique was left. Losses are not that big, but first you can do a preliminary search for similar pairs with a high degree of relevance. And only after preliminary cleaning, set the boundary value equal to 0.75.
Total.
As a result of the transition from the initial sample of the general trigrams to the sample of pentagrams, the work of the script was significantly accelerated. The search for 0.8 pairs among 138.5 thousand names with similar relevance was about 5 hours instead of one month. Those. a single search for similar names in the entire database for a given name is (if we take the quadratic dependence of the processing time) about 0.26 s. There are a few, but we must bear in mind that this was a test run not on a server with a powerful processor and high-performance disk system, but on a half-dead home computer eight years ago.
In principle, it would be possible to carry out an initial search even with hexagrams (index substrings of 6 characters). This gave acceleration several times. But there was a well-founded fear that many pairs of Asian short names could be eliminated (and not only them). Yes, and there is no special meaning, because subsequent manual analysis of the resulting pairs (and there are more than 11,000 of them) will take, in any case, several months.
But indexing with hexagrams is quite suitable for prompts like "Maybe you meant ..." when excessive paranoid is not required, but on the contrary, one should find the minimum number of the most appropriate names.
In principle, when working with long names, you can use indices of seven-character substrings. And in order to place them in 32-bit variables, perform preliminary phonetic coding in order to reduce the number of characters to 15 and a space. But such a task at that moment was not.
Unfortunately, shortly after writing the script, the site was blocked for copyright infringement. Although he soon moved to another address, it was not until the search for duplicate names because of problems that had piled on.
Algorithm fragments:
DB Table StructuresCREATE TABLE IF NOT EXISTS `prs_persons` ( `id` int(10) unsigned NOT NULL, `name_ru` varchar(255) collate utf8_unicode_ci default NULL, // // PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; CREATE TABLE IF NOT EXISTS `prs_normal` ( `id` int(10) unsigned NOT NULL, `name` varchar(255) collate utf8_unicode_ci default NULL, // `num` int(2) unsigned NOT NULL, // PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; CREATE TABLE IF NOT EXISTS `prs_5gramms` ( `code` int(10) unsigned NOT NULL, // `id` int(10) unsigned NOT NULL, // , KEY `code` (`code`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Constants and auxiliary variables, preliminary actions: $opt_debug_show_sql = 0; $opt_debug_mode = 0; $opt_table_prefix = "prs"; define('MIN_RELEVANT', 0.8); define('SITE_PREFIX', "…");
The body of the main comparison cycle: The function of formatted queries to MySQL The result of the test run on the first thousand records:
Work resultADDRESS-1 | NAME-1 | ADDRESS-2 | NAME-2 | RELEVANCE |
---|
/ people / 784 | PETER JASON | / people / 389 | PETER JACKSON | 0.8125 |
/ people / 1216 | CHARLES DENS | / people / 664 | CHARLES DENNIS | 0.8 |
/ people / 1662 | STEWART F WILSON | / people / 1251 | STEWART WILSON | 0.914285714286 |
/ people / 1798 | Michael mann | / people / 583 | MICHAEL LEHMANN | 0.846153846154 |
/ people / 2062 | MICHAEL BIRN | / people / 265 | MICHAEL BIN | 0.8 |
/ people / 2557 | Gina Davis | / people / 963 | GIN DAVIS | 0.8 |
/ people / 3093 | DJ JONSON | / people / 911 | Don Johnson | 0.818181818182 |
/ people / 3262 | TOM EVERETT | / people / 586 | TOM EVERETT SCOTT | 0.848484848485 |
/ people / 3329 | ROBERT RITTI | / people / 3099 | ROBERT BITTI | 0.827586206897 |
/ people / 3585 | TRACY KEY WOLF | / people / 2810 | TRAISY WOLF | 0.857142857143 |
/ people / 3598 | ALEXANDER KALUGIN | / people / 2852 | ALEXANDER KALYAGIN | 0.85 |
/ people / 3966 | SERGEY BODROV | / people / 2991 | SERGEY BODROV ML | 0.888888888889 |
/ people / 3994 | SERGEI BOBROV | / people / 3966 | SERGEY BODROV | 0.8125 |
/ people / 4049 | Richard Lewis | / people / 2063 | Richard J. Lewis | 0.882352941176 |
/ people / 4293 | JERRY LIVLEY | / people / 2006 | Jerry lee | 0.88 |
/ people / 4377 | JOAN KYUSAK | / people / 3774 | JOHN KYUSAK | 0.827586206897 |
/ people / 4396 | DIN MCDERMOTT | / people / 2614 | DILAN MCDERMOTT | 0.833333333333 |
/ people / 4608 | Shaun Johnston | / people / 2036 | DJ J. Johnston | 0.8 |
/ people / 4981 | CHRISTOPHER MEY | / people / 3233 | CHRISTOPHER MURRAY | 0.8 |
/ people / 5019 | JANE ALEXANDER | / people / 381 | JASON ALEXANDER | 0.842105263158 |
/ people / 5551 | CARLOS ANDRES HOMEZ | / people / 1311 | CARLOS GOMEZ | 0.810810810811 |
/ people / 5781 | ALEX NEUBERGER | / people / 4288 | ALEX NUBERGER | 0.8 |
/ people / 5839 | JOY TRAVOLTA | / people / 935 | JOHN TRAVOLTA | 0.8125 |
/ people / 5917 | JOE JOHNSTON | / people / 2036 | DJ J. Johnston | 0.833333333333 |
/ people / 5917 | JOE JOHNSTON | / people / 4608 | Shaun Johnston | 0.8 |
/ people / 6112 | THOMAS RYAN | / people / 4869 | THOMAS JAY RYAN | 0.823529411765 |
/ people / 6416 | BRIAN GEORGE | / people / 3942 | GEORGE BRIANTH | 0.848484848485 |
/ people / 6520 | JOHN CARNEY | / people / 5207 | JOHN KANI | 0.8 |
/ people / 6834 | JOHN JOHN ANDERSON | / people / 5049 | JOE ANDERSON | 0.838709677419 |
/ people / 6836 | MICHAEL ESTON | / people / 5056 | MICHAEL WESTON | 0.827586206897 |
/ people / 6837 | DAVID BARONE | / people / 5884 | DAVID BARON | 0.827586206897 |
/ people / 7261 | BILLY GRAY | / people / 1695 | BILLY RAY | 0.8 |
/ people / 7361 | ALAN DAVID | / people / 3087 | DAVID ALAN BASH | 0.838709677419 |
/ people / 7447 | DAVID AEYER | / people / 2277 | TAYER DAVID | 0.814814814815 |
/ people / 7497 | ALEXANDER KARAMNOV | / people / 3857 | ALEXANDER KARPOV | 0.8 |
/ people / 7499 | NIKOLAS LAMLI | / people / 4424 | NIKOLAS LEE | 0.827586206897 |
/ people / 7534 | RICHARD RICHL | / people / 3547 | Richard Nichl | 0.857142857143 |
/ people / 7547 | SVETLANA STARIKOVA | / people / 1985 | SVETLANA SVETIKOVA | 0.8 |
/ people / 7677 | JAIS ALEXANDER | / people / 381 | JASON ALEXANDER | 0.842105263158 |
/ people / 7677 | JAIS ALEXANDER | / people / 5019 | JANE ALEXANDER | 0.833333333333 |
/ people / 8000 | GREGORY SMITH | / people / 7628 | GREGORY P SMITH | 0.909090909091 |
/ people / 8137 | KASPER KRISTENSEN | / people / 128 | JESPER KRISTENSEN | 0.8 |
/ people / 8186 | SHANE COG | / people / 6235 | KAIN KOSUGI | 0.814814814815 |
/ people / 8219 | BRANDON JAMES OLSON | / people / 797 | JAMES OLSON | 0.810810810811 |
/ people / 8442 | Gunnar Jolfsson | / people / 7033 | Gunnar Johnson | 0.8 |
/ people / 8458 | JOHN ALEXANDER | / people / 381 | JASON ALEXANDER | 0.810810810811 |
/ people / 8458 | JOHN ALEXANDER | / people / 5019 | JANE ALEXANDER | 0.8 |
/ people / 8614 | DAVID HERMAN | / people / 4945 | DAVID HEYMAN | 0.8 |
/ people / 8874 | NIKOLAS ROUG | / people / 1667 | NIKOLAS ROU | 0.827586206897 |
/ people / 8987 | DAVID ROSS | / people / 4870 | DAVID CROSS | 0.814814814815 |
/ people / 9132 | ROBERT LONG | / people / 7683 | ROBERT LONGO | 0.827586206897 |
/ people / 9202 | ROBERT MENDEL | / people / 3410 | ROBERT MANDEL | 0.8125 |
/ people / 9229 | ASHLEY LAWRENCE | / people / 2534 | LAWRENCE ASHLEY | one |
/ people / 9303 | JOHN EYLARD | / people / 8703 | JOHN EYLUARD | 0.827586206897 |
/ people / 9308 | Shaun Roberts | / people / 6552 | SEAN ABOUT ROBERTS | 0.903225806452 |
/ people / 9347 | STEPHEN SERGE | / people / 2911 | Stefen Surzhik | 0.8 |
/ people / 9432 | POLLY SHANNON | / people / 2240 | MOLLY SHANNON | 0.8 |
/ people / 9583 | JULIE HARRIS | / people / 904 | JULIUS HARRIS | 0.838709677419 |
/ people / 9788 | ANTHONY STARR | / people / 8308 | ANTHONY STARK | 0.8 |
/ people / 9835 | MICHAEL WATKINS | / people / 4727 | MICHAEL IN WATKINS | 0.864864864865 |
/ people / 9893 | STEVE MARTINO | / people / 6457 | STEVE MARTIN | 0.827586206897 |
Links and notes:
1. Phonetic coding.
→
Phonetic algorithms. Habrahabr→
“What is your last name”, or Russian MetaPhone (Russian Metaphone description)2. Levenshtein distance. Wikipedia3. Calculation of similarity of words based on the allocation of common forms implemented in PHP in the function similar_text. The PHP documentation states that the basis of this function is the Oliver algorithm [1993]. However, the original algorithm, entitled “Ratcliff / Obershelp Pattern Recognition Algorithm,” was published by John W. Ratcliff on the Dr. Programmers website. Dobb's (
Pattern Matching: the Gestalt Approach ) in 1988. And Ian Oliver just used it in his book Programming Classics: Implementing the World's Best Algorithms.
4. The sources of the Jaro-Winkler algorithm can be viewed, for example, here:
Jaro-Winkler Distance5. Trigram index or "Search with misprints". Habrarhabr6. Jeannette “The Angel” Dover is a character of the cult media franchise “
Wing Commander ”, which includes computer space simulators, strategies, other forms of games, literary works and a
film . Here is given solely due to the fact that I, without understanding the sound of French surnames, have long believed that it sounds “Devereaux”. The illustration of the erroneous translation is not “by ear”, but “by eye”.
7. The basis of the function of formatted SQL queries was honestly borrowed from Artem Kamnev and Maxim Mesilov from the article “SQL Injections: A Struggle for Pleasure”. Unfortunately, the site of these guys recently does not load, but a copy of the article can still be found:
Sql-injections: a struggle for pleasure (unfortunately, without source codes). Removed some controversial features. Instead, added the ability to pass as an argument to the array.