📜 ⬆️ ⬇️

Search for inaccurate matches, search for input errors

Foreword



Our company has its own CRM and periodically data about certain organizations with an exact address are added to this system, and the main thing is that these addresses are essentially unique, that is, there should not be several organizations in the system at the same address (specificity, in fact, but controlled by chelfak *). Recently, KLADR was bolted into the system, but it could not be a panacea, because KLADR has a lot of inaccuracies, many of us. items were left without house numbers, etc. etc., although these addresses are in reality (data are provided by company employees and they are reliable). In general, the input address left in free form with a hint from KLADr. At once I want to say that we refused the combination of fields, since the variety of abbreviations of abbreviations did not promise anything good, moreover, an address on the similarity was quite permissive (“Ololoshskoe sh. 5km”, “Shopping center Veselchak U” or even “Central Market”). And finally, the main enemy of the programmer is chelfuck, which implies illiteracy and typos to a sticky keyboard and typos. The rest is under the cut ...


What we have


')
We have on the one hand: the database filled with certain addresses in one field with another hurrying to report to the employee system with all the consequences. I wanted to secure data from duplicates as much as possible and it was decided to display a warning to the user about possible duplication of the record.

Solution to the problem



I tried to comment on the algorithm as accurately as possible, so I’ll manage with a brief description.



How does it work



Next, just the code:

function clearAddr($addr) { $associate = array( " " => "", //    "." => "", "." => "", "." => "", "." => "", "." => "", "." => "", "." => "", "." => "", "-." => "", "-" => "", "-." => "", "-" => "", "-." => "", "-" => "", "." => "", "" => "", ".-" => "", "." => "", "" => "", "." => "", "." => "", "." => "", "." => "", "." => "", "-." => "", "-" => "", "-" => "", "-." => "", "1-" => "", "2-" => "", "3-" => "", "4-" => "", "5-" => "", "6-" => "", "7-" => "", "8-" => "", "9-" => "", "1-" => "", "2-" => "", "3-" => "", "4-" => "", "5-" => "", "6-" => "", "7-" => "", "8-" => "", "9-" => "" ); $clrd_addr = strtolower(strtr($addr, $associate)); return $clrd_addr; } function getNums($search) { preg_match_all("/[0-9]*/", $search, $matches); $matches = array_diff($matches[0], array("")); //     $matches return $matches; } function getMatchAdress($addr_string, &$Addr_array) { if(!isset($addr_string) || strlen($addr_string) < 1) return false; $list = array(); $nums = getNums($addr_string); //       $addr_string = clearAddr(preg_replace("/[0-9]*/", "", $addr_string)); //     $word_parts = explode("\n", chunk_split(trim($addr_string), 2)); //       2  array_pop($word_parts); //     foreach($Addr_array as $row) { $word_match = 0; $last_pos = 0; //    $clr_row = clearAddr($row); $row_nums = getNums($row); //   ..    foreach($word_parts as $syllable) { $match_in = strpos($clr_row, strtolower(trim($syllable)), $last_pos); //  -   //        if($match_in > -1 && $match_in < $last_pos + 4) { $last_pos = $match_in + strlen(trim($syllable)); $word_match++; } } $all_percents = count($word_parts); //      $found_percents = $word_match; //    $match_perc = round($found_percents * 100 / $all_percents); //    $max_point = 70; //    //         if($match_perc >= $max_point) { if(!empty($nums)) { //      foreach($nums as $num) { if(in_array($num, $row_nums)) $list[] = $row; } } else { //    $list[] = $row; } } } return $list; } 


What it looks like



What is in the database:

. , .40
. , .14
4- ., .1
. , .15
., .6
. ., .48
...
...
...
. , .31/22
. , .23, . 41
. -, .11
- , .39, .1
4- ., .4
, .2


We give at the entrance: Doryninsiy

At the exit with us:

 Array ( [0] => 4-  ., .1 [1] => 4-  ., .4 ) 


If the search query was Doryninsy d.1, then we would see only the zero key.

PROFIT!, Thanks for standing.

Source: https://habr.com/ru/post/140943/


All Articles