📜 ⬆️ ⬇️

Finding similar names using MySQL + PHP

The topic voiced in the title of the article is not new. On the Internet, you can find many questions on how to implement it, but there are a few less answers. And it is not rare that they boil down to tips on using third-party products, for example, Sphinx. But often in the use of such bulky add-ins is not necessary.

Not so long ago, the following task arose:



This is what is "given." A "required" the following:


By automatically viewing the database, we can identify and form pairs of names that are similar in some criteria, which can be transmitted for further manual analysis to determine whether they belong to the same person or different. Moreover, the search was supposed to have some paranoid. Those. in contrast to " Perhaps you meant ... " had to give out a few more names on the principle " Better to perebdet than short time ." And, therefore, a somewhat larger number of pairs should have analyzed for similarity. And, to do it not in real time, but for a really finite time. That is, not in a day or a week, but, in the worst case, in a few hours.

In general, if briefly, there are almost one and a half hundreds of thousands of names, nicknames, pseudonyms without division into separate fields (last name, first name, middle name, etc.) with an arbitrary word order, typos, errors, with different spellings. Moreover, the names have a variety of national affiliation - Russian, English, Jewish, Polish, French, Korean, Chinese, Japanese, Indian, Indian, etc. It is required to find among them all sorts of pairs of such that strongly resemble the same name.


')
At first, the task seemed extremely difficult. To me, a person unfamiliar with a fuzzy search, it was difficult even to imagine formal criteria that allow judging the similarity or difference of names.

But gradually began to form an understanding of the approaches and algorithms used for this purpose. First of all, these were algorithms specifically targeted or simply well-proven when searching for spelling errors in names:


According to common sense, the following was taken:


Phonetic coding is not suitable due to the specifics of filling the database. Phonetic search is able to find names that have been recorded with errors in the ear, but not translated by amateur amateurs. Let me explain with an example:

Jeannette "Angel" Deveru

Here comes the captain of the third rank, Jeannette Devereaux [6] , to our native passport office and is presented clearly and loudly according to the charter. And the passportist can write down her name and "Diver" and "Deviryu" and even "Divyra" by ear. Yes, double the consonants in the name to skip. All these "auditory hallucinations" are perfectly tracked by phonetic coding fuzzy search algorithms.

But here's a pseudo "translator" who does not know the specifics of the French language (but who is familiar with the English language), while rewriting data from a military card, can easily write "Jennette Deverox". Those who are especially gifted can indicate their callsign or write their last name in front of their first name.

Or even many English-language “translators” do not know that the name “Ralph Fiennes” should not be read “Ralph”, but “Rafe Fiennes”.

So, phonetic coding is not suitable here, because in this case, erroneous names are not written “like hearing CC a”, but “like the type of RTCR ”.

An attempt to apply phonetic algorithms to names that are perceived not by ear, but by sight, can lead to the fact that the phonetic indexes of names will differ from each other even more than the way they are written.

So maybe use edit distance?


It would be a good idea if it were not for the fact that these metrics are mostly designed specifically for comparing words, and not phrases with an arbitrary word order. It is necessary to change the name and surname in places or insert a pseudonym between them, and they show an extremely unsatisfactory result.

It is necessary in this case to break the names into separate words and compare each of them already. And this significantly complicates (and slows down) the work of the algorithm, even if you first bring these individual words to their phonetic index. Also, the added complexity is that the number of words in a name is non-constant. Taking into account possible omissions and word permutations, it is already difficult to even imagine the laboriousness of this process.

Other methods


The disadvantages of the allocation of general forms of the metrics are the large time complexity of calculating the relevance with unclear indexing possibilities, especially short names. Even a single selection of the maximum common substring has the time complexity O (nm). And since it is necessary then to continue the recursive search for common substrings in the residuals, the calculation of the relevance function is a very expensive operation, losing even to the N-gram comparison.

Again, when changing the words of places, relevance decreases sharply.

Jaro-Winkler distance calculation is significantly faster. But, unfortunately, it is also not resistant to changing words in places.

So, in order not to have to separate and combine words in the name, willy-nilly, we had to stop at the N-gram analysis. Or rather, even on trigramm, because short Korean names do not have the effect of highlighting too long N-grams.

Stage one. Front impact.


So, to begin with, in order to get an idea of ​​the time complexity of a trigram comparison, I wrote the simplest version of a script that compares names in pairs across the database.

The script was written and tested at home on an eight-year-old computer, with a single-core Intel Pentium D 925 on an ASUS P5B-E motherboard and with ordinary Seagate hard-core hard drives of about the same age.

This script worked, as they say, in the forehead. I read the database, each name, previously normalized, divided it into unique trigrams and counted the number of their occurrences in each of the previous names (that is, being in the database of the previously tested person). If the result of dividing twice the number of matches by the sum of unique trigrams in each of these names exceeded a certain threshold (for example, 0.75 or 0.8), then this was recorded in the file.

The normalization of the name was as follows:


As a result of this normalization, a simplified name with no punctuation marks, no transliteral characters was obtained, in which each word is framed or separated from other words by double spaces.

The use of double spaces between words, as well as at the beginning and end of the name, made the trigram analysis algorithm completely insensitive to shifting words.

In the normalized name, various unique trigrams were allocated - combinations of three characters arranged in a row (including spaces).

Then, for two names, the following relevance coefficient was calculated: the double number of common trigrams was divided by the sum of unique trigrams in each name.

The requirement for the uniqueness of trigrams is announced for a reason. On the one hand, counting the total number of trigrams in names without taking into account their uniqueness will allow us to confidently distinguish such names as "John Smith" and "John John Smith". And taking into account only unique trigrams will lead to the fact that these names will be considered identical.

On the other hand, firstly, the need to distinguish such names will not occur often. And, secondly, the calculation of relevance for all trigrams, and not only for unique ones, will significantly complicate this procedure. Because instead of the doubled number of common trigrams, it is necessary to count the sum of entries of trigrams from the first name to the second and trigrams from the second name to the first. That is, roughly it will increase the time spent on the calculation of relevance by about 2 times. Well and, thirdly, it will not allow to apply to the procedure for calculating the relevance of some optimization, which will be discussed further.

So, the frontal algorithm was written and tested on the first few thousand names. The result was depressing. As was to be expected, the time for finding matches across the entire array of names had a pronounced quadratic dependence on its size. The trend forecast for processing the entire array of 130-140 thousand names was about two years. It will be a bit much!

But there was an initial assessment of the performance of the algorithm.

Stage Two. Search for ways to optimize.


First of all, I had to evaluate which elements of the algorithm need to be optimized, and which one is the notorious “bottleneck” that brakes the entire system. There were several areas of optimization:

  1. The initial sample from the database should not give out the entire array of names, but only those that are already at least slightly similar to the one being tested, to reduce the number of other operations. The analysis of the execution time of different sections of the code showed that the queries to the database are a bottleneck. It was necessary to reduce both the execution time of each request and the number of issued candidate names in each sample.
  2. Comparing and copying substrings performed with UTF-8 strings in multibyte encoding are several times lower in performance than those of single-byte encoding strings.
  3. It is not necessary to make an exact calculation of the relevance coefficient if the share of total trigrams is small. In this case, it is possible to interrupt this process as soon as it becomes clear. that even if all the other trigrams are common, this will not result in exceeding the required boundary value.
  4. Other small things.

Stage Three. Implementation.


First of all, as the simplest, an attempt was made to transition from the encoding of normalized UTF-8 names to the encoding of Windows-1251. However, it turned out that MySQL is very jealous of storing strings in a database in different encodings, even in different tables. That leads to nonlocalizable errors during database operations. Therefore, the normalized name is still stored in an additional attached table in UTF-8 format, and, if necessary, is recoded in Windows-1251.

Storing a normalized name significantly saves time, because normalization is necessary for each entry only once.

In addition, the number of unique trigrams in the normalized name is stored in the same joined table. This value can be used to force the completion of a relevance calculation in the event that too much difference between the names becomes apparent. In this case, to calculate each time the number of unique trigrams in the name is no longer necessary.

And, most importantly, the processed names are indexed with a trigram index. Those. in the additional table, all unique trigrams present in the normalized name are listed. At the same time, the trigrams themselves, in order not to store them in the “slow” UTF-8 encoding, were stored as an integer hash - the three lower bytes of the number corresponded to the numeric values ​​of the corresponding characters in the Windows-1251 encoding.

But at the same time, the approach was completely ineffective, as it was found out later, the approach - for each trigram from the checked name, its own selection of the names of candidates was made, which were then combined into a single array. In addition, each name in this array was accompanied by a counter — in how many samples it was encountered. It was assumed that this would give some gain in the calculation of relevance due to the fact that it would not be necessary to then decompose these names into trigrams - it was enough to use this counter to immediately calculate the relevance.

Unfortunately, multiple sampling from the MySQL database (to a greater extent) and the formation of a combined array of names (to a lesser extent) gave such overhead costs that many times exceeded the gain obtained by simplifying the calculation of relevance.

As a result of the described optimizations, a significant reduction in time costs was achieved:

The use of a trigram index for the initial sample reduced the resulting time prediction from two years to less than three months. Using numeric values ​​instead of substrings in the UTF-8 format allowed us to further reduce the overall processing time by another 10-12%. But still, 2.5 months of uninterrupted work seemed too long. Subtle place remained the initial selection of similar names.

Stage Four. Further improvements.


Repeated access to the database for each trigram index was not the best idea. Not only that numerous samples were performed for a long time, so also the consolidated array was very large, and its processing took considerable time despite the use of the total trigrams counter for each element.

I had to abandon the idea of ​​this counter, but receive the entire initial set of candidate names for a single query using the WHERE IN SQL construct. This made it possible to reduce the forecast a few more times to one month.

Further, the procedure for calculating relevance was changed. Based on the lower bound of relevance and the number of unique trigrams in the names, the maximum allowable number of “misses” was calculated - the number of trigrams in the name of the applicant, not included in the checked name. Then all the trigrams of the name-applicant were checked for entry into the name being checked. As soon as the number of “misses” exceeded the maximum allowed, the comparison stopped. This has reduced the forecast time by a few percent. But it was still very far from industrial values.

Stage Five. Decision.


The task seemed intractable. On the one hand, it was necessary to significantly reduce the number of names in the initial sample, but on the other hand, this reduction should not have been too artificial not to accidentally cut off names that are close to the lower limit of the allowable range of relevance.

Finding an approach was helped by simply viewing the found pairs of similar names. Practically in all such pairs there were common subsequences with a length of 6-7 characters (including bordering double spaces). By simple reasoning (which is not given here because of their lack of rigor), it can be concluded that in order to achieve a relevance of more than 0.75, normalized names must have a common substring of 5 characters or longer.

Thus, from the indexation of the preliminary search with trigrams, we proceed to the indexation with pentagrams (5-character substrings). To fit the pentagram hash in an integer variable that has a size of 32 bits in the 32-bit version of PHP, we represent each character as 5-bit, the benefit of all 31 characters (without “” and “”) and a space.

Another small addition.


As I wrote above, when calculating the relevance, unique trigrams of one of the names were compared for matching with the trigrams of the second name. In this case, the calculation was stopped when the number of “misses” exceeded the maximum permissible amount.

If the number of unique trigrams in the first name is significantly less than the number of unique trigrams in the second, then the calculated maximum allowable number of misses may be negative. This means that these names can by no means be similar to a given relevance value. And there is no need to check them.

But this cutoff does not work in the opposite case, if the number of unique trigrams in the first name is much greater than the number of those in the second name. Therefore, additionally, the boundary values ​​of the number of unique trigrams in the normalized name were entered into the sample query.

This is a somewhat controversial improvement. When setting the threshold value of relevance 0.75, it does not lead to acceleration, but, on the contrary, leads to a loss of productivity by 3% due to an increase in the complexity of the sample. But if we set the relevance limit of 0.8, then we get already 3% of the winnings.

However, this technique was left. Losses are not that big, but first you can do a preliminary search for similar pairs with a high degree of relevance. And only after preliminary cleaning, set the boundary value equal to 0.75.

Total.


As a result of the transition from the initial sample of the general trigrams to the sample of pentagrams, the work of the script was significantly accelerated. The search for 0.8 pairs among 138.5 thousand names with similar relevance was about 5 hours instead of one month. Those. a single search for similar names in the entire database for a given name is (if we take the quadratic dependence of the processing time) about 0.26 s. There are a few, but we must bear in mind that this was a test run not on a server with a powerful processor and high-performance disk system, but on a half-dead home computer eight years ago.

In principle, it would be possible to carry out an initial search even with hexagrams (index substrings of 6 characters). This gave acceleration several times. But there was a well-founded fear that many pairs of Asian short names could be eliminated (and not only them). Yes, and there is no special meaning, because subsequent manual analysis of the resulting pairs (and there are more than 11,000 of them) will take, in any case, several months.

But indexing with hexagrams is quite suitable for prompts like "Maybe you meant ..." when excessive paranoid is not required, but on the contrary, one should find the minimum number of the most appropriate names.

In principle, when working with long names, you can use indices of seven-character substrings. And in order to place them in 32-bit variables, perform preliminary phonetic coding in order to reduce the number of characters to 15 and a space. But such a task at that moment was not.

Unfortunately, shortly after writing the script, the site was blocked for copyright infringement. Although he soon moved to another address, it was not until the search for duplicate names because of problems that had piled on.

Algorithm fragments:


DB Table Structures
CREATE TABLE IF NOT EXISTS `prs_persons` ( `id` int(10) unsigned NOT NULL, `name_ru` varchar(255) collate utf8_unicode_ci default NULL, //     //  PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; CREATE TABLE IF NOT EXISTS `prs_normal` ( `id` int(10) unsigned NOT NULL, `name` varchar(255) collate utf8_unicode_ci default NULL, //  `num` int(2) unsigned NOT NULL, //      PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; CREATE TABLE IF NOT EXISTS `prs_5gramms` ( `code` int(10) unsigned NOT NULL, //   `id` int(10) unsigned NOT NULL, // ,    KEY `code` (`code`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; 


Constants and auxiliary variables, preliminary actions:
  $opt_debug_show_sql = 0; $opt_debug_mode = 0; $opt_table_prefix = "prs"; define('MIN_RELEVANT', 0.8); define('SITE_PREFIX', "…"); //      . //     , //     . $len_ratio = MIN_RELEVANT / (2 - MIN_RELEVANT); $suitablesin = iconv("Windows-1251", "UTF-8", "AaBbCcEegHKkMmnOoPprTuXxYy03468"); $suitablesout = ""; $charcode = array( ' ' => 0, '' => 1, '' => 2, '' => 3, '' => 4, '' => 5, '' => 6, '' => 7, '' => 8, '' => 9, '' => 10, '' => 11, '' => 12, '' => 13, '' => 14, '' => 15, '' => 16, '' => 17, '' => 18, '' => 19, '' => 20, '' => 21, '' => 22, '' => 23, '' => 24, '' => 25, '' => 26, '' => 27, '' => 28, '' => 29, '' => 30, '' => 31); //  . set_time_limit(0); //   . function message_die($errno, $error, $file, $line) { if ($errno) { print "<p><b>Error " . $errno . " " . $file . "(" . $line . "):</b> " . $error; die(); } }; $fout = fopen("…", "w"); //     . //           . fclose($fout); 


The body of the main comparison cycle:
 //    : $id = …; $name_ru = …; // ID      UTF-8. $rusn = " "; //    . $ischar = FALSE; for ($j = 0; $j < mb_strlen($name_ru, "UTF-8"); $j++) { $char = mb_substr($name_ru, $j, 1, "UTF-8"); if (($pos = mb_strpos($suitablesin, $char, 0, "UTF-8")) === FALSE) { if ($ischar) { //     . $rusn .= " "; $ischar = FALSE; } } else { //   Windows-1251    . $rusn .= $suitablesout{$pos}; $ischar = TRUE; } } if ($ischar) $rusn .= " "; //    . if (strlen($rusn) < 5) continue; //      . $norm = iconv("Windows-1251", "UTF-8", $rusn); //  UTF-8    . //      : $subgramms = array(); //     ,   0. //    ,        . $code = ($charcode[$rusn{2}] << 5) | $charcode[$rusn{3}]; for ($j = 4; $j < strlen($rusn); $j++) { $code = (($code << 5) | $charcode[$rusn{$j}]) & 0x1FFFFFF; $subgramms[$code] = $code; //    . } //          . $trigramms = array(); for ($k = 0; $k < strlen($rusn) - 2; $k++) $trigramms[$trigramm = substr($rusn, $k, 3)] = $trigramm; $n = count($trigramms); //        : $nmin = ceil($n * $len_ratio); $nmax = floor($n / $len_ratio); //       . $similars = fquery("SELECT n.id AS id, n.name AS name, n.num AS num FROM ^@5gramms AS g, ^@normal AS n WHERE g.code IN (^N) AND n.id = g.id AND n.num >= ^N AND n.num <= ^N", $subgramms, $nmin, $nmax); //        . fquery("INSERT INTO ^@normal (id, name, num) VALUES (^N, ^S, ^N)", $id, $norm, $n); //       . foreach ($subgramms as $key=>$code) fquery("INSERT INTO ^@5gramms (code, id) VALUES (^N, ^N)", $code, $id); unset($subgramms); //  . //  -: for ($i = 0; $i < @mysql_num_rows($similars); $i++) { $similar = @mysql_fetch_assoc($similars) OR message_die(@mysql_errno(), @mysql_error(), __FILE__, __LINE__); $name = iconv("UTF-8", "Windows-1251", $similar['name']); $simid = $similar['id']; $m = $similar['num']; //  . $nm = 0; //  . $done = TRUE; $miss = floor($m - MIN_RELEVANT * ($n + $m) / 2); //  "". for ($k = 0; ($k < strlen($name) - 2) & ($miss >= 0); $k++) { if (strpos($name, $trigramm = substr($name, $k, 3), 0) == $k) { //    (): if (isset($trigramms[$trigramm])) $nm++; //  . else $miss--; //     . } } if ($miss >= 0) //    ,      //     . fwrite($fout, SITE_PREFIX . $id ."\t" . $rusn ."\t" . SITE_PREFIX . $key . "\t" . $name . "\t" . ($nm + $nm) / ($m + $n) . "\r\n"); } @mysql_free_result($similars) OR message_die(@mysql_errno(), @mysql_error(), __FILE__, __LINE__); unset($trigramms); //     . fclose($fout); //     . 


The function of formatted queries to MySQL
[7]
 // Syntax: fquery($query_template_text, $argument's_1_value, $argument's_2_value, ...) // Special characters for a query template: // ^@TableName - indicates that the combination ^@ is to be replaced with table prefix // ^N - numeric parameter(s) (is not to be quoted), separated by comma if is array // ^S - string parameter(s) (is to be quoted), separated by comma if is array // ^0 - "NULL" or "NOT NULL" // (c) Grigoryev Andrey aka GrAnd aka Pochemuk // Thanks to Kamnev Artjom (Kamnium), Mesilov Maxim (Severus) for idea // http://life.screenshots.ru // When query successed returns Recordset for SELECT or True for others. // When error occurs returns False. function fquery() { global $opt_debug_mode; global $opt_debug_show_sql; global $opt_table_prefix; // getting prefix from the site's options if (is_array(func_get_arg(0))) $args = func_get_arg(0); else $args = func_get_args(); $qtext = $args[0]; // the first argument is always query template text if (empty($qtext)) return false; // Hmm, nothing to do! $qtext = str_replace("^@", ($opt_table_prefix == "") ? "" : ($opt_table_prefix . '_'), $qtext); // replacing with table prefixes $i = 0; $curArg = 1; $query = ""; while ($i < strlen($qtext)) // strlen is always up-to-date, even if special chars are replaced { if ($qtext{$i} == '^') { if ($curArg >= count($args)) return false; // too many parameters in the query template! $i++; switch ($qtext{$i}) { case 'N': { if (is_null($args[$curArg])) { $query .= "NULL"; continue; } if (is_array($args[$curArg])) $query .= implode(", ", $args[$curArg]); else $query .= $args[$curArg]; break; } case 'S': { if (is_null($args[$curArg])) { $query .= "NULL"; continue; } if (is_array($args[$curArg])) $query .= "'" . implode("', '", $args[$curArg]) . "'"; else $query .= "'" . $args[$curArg] . "'"; break; } case '0': { if (is_null($args[$curArg])) return false; // incorrect parameter, nulls are not allowed! $args[$curArg] = strtoupper($args[$curArg]); if (($args[$curArg] != "NULL") && ($args[$curArg] != "NOT NULL")) return false; // incorrect parameter, "NULL" or "NOT NULL" only! $query .= $args[$curArg]; break; } default: $query .= $qtext{$i}; } $i++; $curArg++; } else $query .= $qtext{$i++}; } if ($opt_debug_show_sql == 1) print('<P><CODE>Query string: "' . $query . '"<CODE></P>' . "\r\n"); $ResultData = mysql_query($query); if (mysql_errno() <> 0) { if ($opt_debug_mode == 1) { print('<P><CODE>MySQL error: #' . mysql_errno() . ': ' . mysql_error() . ' '); print('Query string: ' . $query . '</CODE></P>'); } } return($ResultData); } // End of fquery 


The result of the test run on the first thousand records:


Work result
ADDRESS-1NAME-1ADDRESS-2NAME-2RELEVANCE
/ people / 784PETER JASON/ people / 389PETER JACKSON0.8125
/ people / 1216CHARLES DENS/ people / 664CHARLES DENNIS0.8
/ people / 1662STEWART F WILSON/ people / 1251STEWART WILSON0.914285714286
/ people / 1798Michael mann/ people / 583MICHAEL LEHMANN0.846153846154
/ people / 2062MICHAEL BIRN/ people / 265MICHAEL BIN0.8
/ people / 2557Gina Davis/ people / 963GIN DAVIS0.8
/ people / 3093DJ JONSON/ people / 911Don Johnson0.818181818182
/ people / 3262TOM EVERETT/ people / 586TOM EVERETT SCOTT0.848484848485
/ people / 3329ROBERT RITTI/ people / 3099ROBERT BITTI0.827586206897
/ people / 3585TRACY KEY WOLF/ people / 2810TRAISY WOLF0.857142857143
/ people / 3598ALEXANDER KALUGIN/ people / 2852ALEXANDER KALYAGIN0.85
/ people / 3966SERGEY BODROV/ people / 2991SERGEY BODROV ML0.888888888889
/ people / 3994SERGEI BOBROV/ people / 3966SERGEY BODROV0.8125
/ people / 4049Richard Lewis/ people / 2063Richard J. Lewis0.882352941176
/ people / 4293JERRY LIVLEY/ people / 2006Jerry lee0.88
/ people / 4377JOAN KYUSAK/ people / 3774JOHN KYUSAK0.827586206897
/ people / 4396DIN MCDERMOTT/ people / 2614DILAN MCDERMOTT0.833333333333
/ people / 4608Shaun Johnston/ people / 2036DJ J. Johnston0.8
/ people / 4981CHRISTOPHER MEY/ people / 3233CHRISTOPHER MURRAY0.8
/ people / 5019JANE ALEXANDER/ people / 381JASON ALEXANDER0.842105263158
/ people / 5551CARLOS ANDRES HOMEZ/ people / 1311CARLOS GOMEZ0.810810810811
/ people / 5781ALEX NEUBERGER/ people / 4288ALEX NUBERGER0.8
/ people / 5839JOY TRAVOLTA/ people / 935JOHN TRAVOLTA0.8125
/ people / 5917JOE JOHNSTON/ people / 2036DJ J. Johnston0.833333333333
/ people / 5917JOE JOHNSTON/ people / 4608Shaun Johnston0.8
/ people / 6112THOMAS RYAN/ people / 4869THOMAS JAY RYAN0.823529411765
/ people / 6416BRIAN GEORGE/ people / 3942GEORGE BRIANTH0.848484848485
/ people / 6520JOHN CARNEY/ people / 5207JOHN KANI0.8
/ people / 6834JOHN JOHN ANDERSON/ people / 5049JOE ANDERSON0.838709677419
/ people / 6836MICHAEL ESTON/ people / 5056MICHAEL WESTON0.827586206897
/ people / 6837DAVID BARONE/ people / 5884DAVID BARON0.827586206897
/ people / 7261BILLY GRAY/ people / 1695BILLY RAY0.8
/ people / 7361ALAN DAVID/ people / 3087DAVID ALAN BASH0.838709677419
/ people / 7447DAVID AEYER/ people / 2277TAYER DAVID0.814814814815
/ people / 7497ALEXANDER KARAMNOV/ people / 3857ALEXANDER KARPOV0.8
/ people / 7499NIKOLAS LAMLI/ people / 4424NIKOLAS LEE0.827586206897
/ people / 7534RICHARD RICHL/ people / 3547Richard Nichl0.857142857143
/ people / 7547SVETLANA STARIKOVA/ people / 1985SVETLANA SVETIKOVA0.8
/ people / 7677JAIS ALEXANDER/ people / 381JASON ALEXANDER0.842105263158
/ people / 7677JAIS ALEXANDER/ people / 5019JANE ALEXANDER0.833333333333
/ people / 8000GREGORY SMITH/ people / 7628GREGORY P SMITH0.909090909091
/ people / 8137KASPER KRISTENSEN/ people / 128JESPER KRISTENSEN0.8
/ people / 8186SHANE COG/ people / 6235KAIN KOSUGI0.814814814815
/ people / 8219BRANDON JAMES OLSON/ people / 797JAMES OLSON0.810810810811
/ people / 8442Gunnar Jolfsson/ people / 7033Gunnar Johnson0.8
/ people / 8458JOHN ALEXANDER/ people / 381JASON ALEXANDER0.810810810811
/ people / 8458JOHN ALEXANDER/ people / 5019JANE ALEXANDER0.8
/ people / 8614DAVID HERMAN/ people / 4945DAVID HEYMAN0.8
/ people / 8874NIKOLAS ROUG/ people / 1667NIKOLAS ROU0.827586206897
/ people / 8987DAVID ROSS/ people / 4870DAVID CROSS0.814814814815
/ people / 9132ROBERT LONG/ people / 7683ROBERT LONGO0.827586206897
/ people / 9202ROBERT MENDEL/ people / 3410ROBERT MANDEL0.8125
/ people / 9229ASHLEY LAWRENCE/ people / 2534LAWRENCE ASHLEYone
/ people / 9303JOHN EYLARD/ people / 8703JOHN EYLUARD0.827586206897
/ people / 9308Shaun Roberts/ people / 6552SEAN ABOUT ROBERTS0.903225806452
/ people / 9347STEPHEN SERGE/ people / 2911Stefen Surzhik0.8
/ people / 9432POLLY SHANNON/ people / 2240MOLLY SHANNON0.8
/ people / 9583JULIE HARRIS/ people / 904JULIUS HARRIS0.838709677419
/ people / 9788ANTHONY STARR/ people / 8308ANTHONY STARK0.8
/ people / 9835MICHAEL WATKINS/ people / 4727MICHAEL IN WATKINS0.864864864865
/ people / 9893STEVE MARTINO/ people / 6457STEVE MARTIN0.827586206897


Links and notes:


1. Phonetic coding.
→ Phonetic algorithms. Habrahabr
→ “What is your last name”, or Russian MetaPhone (Russian Metaphone description)

2. Levenshtein distance. Wikipedia

3. Calculation of similarity of words based on the allocation of common forms implemented in PHP in the function similar_text. The PHP documentation states that the basis of this function is the Oliver algorithm [1993]. However, the original algorithm, entitled “Ratcliff / Obershelp Pattern Recognition Algorithm,” was published by John W. Ratcliff on the Dr. Programmers website. Dobb's ( Pattern Matching: the Gestalt Approach ) in 1988. And Ian Oliver just used it in his book Programming Classics: Implementing the World's Best Algorithms.

4. The sources of the Jaro-Winkler algorithm can be viewed, for example, here: Jaro-Winkler Distance

5. Trigram index or "Search with misprints". Habrarhabr

6. Jeannette “The Angel” Dover is a character of the cult media franchise “ Wing Commander ”, which includes computer space simulators, strategies, other forms of games, literary works and a film . Here is given solely due to the fact that I, without understanding the sound of French surnames, have long believed that it sounds “Devereaux”. The illustration of the erroneous translation is not “by ear”, but “by eye”.

7. The basis of the function of formatted SQL queries was honestly borrowed from Artem Kamnev and Maxim Mesilov from the article “SQL Injections: A Struggle for Pleasure”. Unfortunately, the site of these guys recently does not load, but a copy of the article can still be found: Sql-injections: a struggle for pleasure (unfortunately, without source codes). Removed some controversial features. Instead, added the ability to pass as an argument to the array.

Source: https://habr.com/ru/post/327722/


All Articles