📜 ⬆️ ⬇️

Phonetic search

A couple of years ago there was a task to write such a search for one of the sites that would recognize typos and suggest corrected requests. Several options were tried, one of which I wanted to write here. Search based on the sound of words can erase language boundaries, since proper names in different languages ​​are consonant. For example, you are looking for “Arnold Schwarzenegger” in Russian - you find “Arnold Schwarzenegger” in English, or you are looking for “Michael Jordan” - you find “Michael Jordan”, or you are looking for “Chuck Norris” - and suddenly he finds you. In addition to searching for consonant words, this method eliminates a large number of typos. And then something zadolbala pops, we need more about insider ...



To clearly understand the principles of this search, it is necessary to have an idea about soundex. I’ll say right away that the proposed solution below is NOT based on soundex, but uses the Daitch-Mokotoff table, which I have Russified and modified by me, to make it more interesting. Soundex is an ancient and well-known thing. The reader may skip the next paragraph if he is already familiar with this algorithm. Then a small introduction for those who are not familiar, so that they can understand what it really is about ...
')

Intro / Soundex


Soundex is used by the national archives of the United States, which store genealogical information about citizens. As one of the requirements, the dudes were presented: the algorithm must find what they are looking for in different spellings, since many proper names are recorded ambiguously (for example, Smith / Smyth). Citizen Russell scratched his American head and rolled it out:
  1. Each word is represented by a code of 4 characters.
  2. The first character of a four-digit code is the first letter of the word being encoded.
  3. Each next letter of the word is replaced by a digit in accordance with the code page.
  4. Characters that are not presented in the table are thrown away and not coded.
Here is the table:



Obviously, only consonants are encoded, since they constitute the phonetically basic part of the word in the English language. Double consonants are encoded as one. The resulting code is truncated or padded with zeros up to 4 characters. For example, Washington is encoded as W252 ("W" first, "a" is thrown out, "s" = 2, "h" is thrown out, "n" = 5, "g" = 2, the remaining characters are thrown out), Lee is encoded as L000 (“L” first, “e” is thrown out twice, 000 - addition to four characters). The first letter of a word always remains original, even if it is a vowel, and even if it is not in the table. Thus, knowing the soundex-code, American grannies can quickly dig out all Smiths, Smyths and Smooths in the file cabinets. Read more about soundex . All the same, soundex is shnyaga, and does not roll, because Jones, James and Jeans codes the same.

Daitch-mokotoff


Hell knows, maybe for English phonetics, soundex is enough, but soundex rules are not suitable for other languages ​​and words, for example, for our great and powerful - in Russian, it is not enough to encode consonants. Chuvachok Deutsch and the taxpayer Mokotoff figured their own table in the evening, taking into account the peculiarities of pronunciation, characteristic of European languages. Here is such a crap:



The coding principle is the same as in soundex, but with additions:

  1. Words are encoded with 6 digits, where each digit denotes one of the sounds from the left column of the table.
  2. When there are few letters in a word, the code is padded to 6 characters with zeros. If there are too many letters, up to 6 characters are truncated. In the word GOLDEN, only four sounds [GLDN] are encoded and 583600 is obtained.
  3. The letters A, E, I, O, U, J, and Y are always replaced by a digit, being the first in a word, such as in the name Alpert 087930. In other cases, these letters are skipped and are not replaced by anything, only if two such letters are in a row form a pair and immediately after the pair comes another vowel. For example, the name Breuer 'eu' is encoded 791900, but not in the name Freud.
  4. The letter H is replaced by a number if it is the first, as in Haber 579000, or if a vowel immediately follows it, as in Manheim 665600, in other cases it is passed.
  5. When adjacent letters form a longer sequence presented in the table, it is necessary to encode the longest suitable variant. Mintz is encoded as MIN-TZ 664000, but not MIN-TZ.
  6. When adjacent letters form two identical codes in a row, they are written as one, for example, TOPF turns into TO-PF 370000, but not TO-PF 377000. The exception to this clause is the combination of MN and NM, which are in any case encoded separately and cannot be merged, as in Kleinman 586660, not 586600.
  7. The CH, CK, C, J, and RS sequences may sound different in some languages ​​- two options are offered for them (in the table, the Russian version is highlighted in red).

The table is designed for languages ​​based on Latin, so in order to encode the Russian language, you need to use the magic translit.

Outro


In total, there are two functions - dmword and dmstring, one encodes a word in a daitch-mokotoff code, the other breaks a string into words and encodes each word, then stitches everything into a string of daitch-mokotoff codes. This submission for implementation, in this case, is written in php, but you can rewrite anything. The resulting codes are read in the database and twitch, as usual. Works with UTF8.
See both functions in source .
dmstring(' ') == dmstring('Michael Jordan') == 658000 493600
dmstring(' ') == dmstring('Arnold Schwarzenegger') == 096830 479465
dmstring(' ') == dmstring('Arnold Schwarzenegger') == 096830 479465

// ... - .

I will say straight away that in some particularly neglected cases such a search may be wrong, but it can be tuned for most languages ​​and for a specific use. It is necessary to resort to it when there are no exact coincidences - like a fallback. The presented version is designed to best support Russian-English transcoding. You can use this:
PR: I love Picamatic

Source: https://habr.com/ru/post/28752/


All Articles