⬆️ ⬇️

Phonetic algorithms

Phonetic algorithms match two words with similar pronunciation to the same codes, which allows comparison and indexing of a set of such words based on their phonetic similarity.



It is often quite difficult to find an atypical last name in the database, for example:

- Lech, look in our database for Adolph Schvardsenegger ,

- Shvortsegir ? There's no such thing!
In this case, the use of phonetic algorithms (especially in combination with fuzzy matching algorithms) can significantly simplify the task.



It is very convenient to use such algorithms when searching databases in lists of people, in spelling checkers. They are often used in conjunction with fuzzy search algorithms (which undoubtedly deserve a separate article), providing users with a convenient search by first and last names in various databases, employee lists, and so on.

')

In this article I will discuss the most well-known algorithms, such as Soundex , Daitch-Mokotoff Soundex , NYSIIS , Metaphone , Double Metaphone , Russian Metaphone , Caverphone .



Soundex



One of the first was the Soundex algorithm, invented by Robert Russell in the 10-x years of the last century. This algorithm (or rather, its American version) compares the words with a numerical index of the form A126 . The principle of its operation is based on the division of consonant letters into groups with sequence numbers, from which then the resulting value is compiled. Later, a number of improvements were also proposed.



image



The first letter is saved, the subsequent letters are compared to the numbers in the table. Characters not represented in the table (and these are all vowels and some consonants) are ignored. Adjacent characters, or characters separated by the letters H or W , belonging to the same group, are written as one. The result is truncated to 4 characters. Missing positions are filled with zeros. It is easy to see that after all these procedures, only 7,000 different variations of such a code remain, which entails a multitude of completely dissimilar words that have the same Soundex code. Thus, the result in most cases includes a large number of β€œfalse positive” values.



image



In the improved version, as you can see, the letters are divided into more groups. In addition, no special attention is paid to the letters H and W, they are simply ignored. In addition, no operations with the result length are performed β€” the code does not have a fixed length and is not truncated.



Examples


Original Soundex:

D341 β†’ Dedlovsky, Dedlovsky, Didilev, Ditelev, Dudalev, Dudolev, Dutlov, Dydalev, Dyatlov, Dyatlovich.

N251 β†’ Nagimov, Nagmbetov, Nazimov, Nasimov, Nassonov, Nezhnov, Neznaev, Nesmeev, Nizhnevskiy, Nikonov, Nikonovich, Nisenblat, Nisenbaum, Nissenbaum, Noginov, Nochnov.



Superior Soundex:

N8030802 β†’ Nasimov, Nassonov, Nikonov.

N80308108 β†’ Nisenbaum, Nissenbaum.

N8040802 β†’ Nagimov, Surge, Neganov, Noginov.

N804810602 β†’ Nagmbetov.

N8050802 β†’ Nazimov, Nezhnov, Sheath.



On average, there are 21 last names per one Soundex code value. In the case of an improved version of Soundex, only 2-3 names are converted to the same code.



NYSIIS



Developed in 1970 as part of the New York State Identification and Intelligence System, this algorithm produces slightly better results relative to the original Soundex, using more complex rules for transforming the source word into the resulting code. This algorithm is designed to work with American surnames.



NYSIIS Code Calculation Algorithm
  1. Convert the beginning of the word according to the following rules:

    MAC β†’ MCC

    KN β†’ N

    K β†’ C

    PH, PF β†’ FF

    SCH β†’ SSS
  2. Convert the end of a word according to the following rules:

    EE β†’ Y

    IE β†’ Y

    DT, RT, RD, NT, ND β†’ D
  3. Then all letters, except the first, are converted according to the following rules:

    EV β†’ AF

    A, E, I, O, U β†’ A

    Q β†’ G

    Z β†’ S

    M β†’ N

    KN β†’ N

    K β†’ C

    SCH β†’ SSS

    PH β†’ FF

    After vowels: remove H, convert W β†’ A
  4. Remove S at the end
  5. Convert AY at the end β†’ Y
  6. Delete A at the end
  7. Trim to 6 characters (optional).


Examples


CASPARAVAS β†’ Kasparavichus, Kasperovich, Kaspirovich.

CATNACAV β†’ Katnikov, Cytnik, Tsotnikov.

LANSANC β†’ Lenchenko, Leonchenko, Linchenko, Lunchenko, Lyamzenko.

PRADSC β†’ Parish, Prohodsky, Prudsky, Prudsky, Prudsky.

STADNACAV β†’ Stadnikov.



NYSIIS converts a little more than two last names to the same code.



Daitch-Mokotoff Soundex



This algorithm was developed in 1985 by two genealogists, Harry Motoff and Randy Deutsch, in an effort to achieve the best, relatively original Soundex, results when working with Eastern European (including Russian) surnames.

This algorithm has little to do with the original Soundex, except that the result is still a sequence of numbers, but now the first letter is also encoded.



It has much more complex conversion rules - now not only single characters but also sequences of several characters are involved in the formation of the resulting code. In addition, the result of the form 023689 provides about 600 thousand different variations of the code, which, together with complicated rules, reduces the number of β€œredundant”, i.e. "False positive" words in the resulting set.



Conversions are carried out according to the following table (the order of transformations corresponds to the order of the letter combinations in the table):

Source lettersAt the beginningBehind the vowelRest
AI, AJ, AY, EI, EY, EJ, OI, OJ, OY, UI, UJ, UY0one
AU07
IA, IE, IO, IUone
EUoneone
A, UE, E, I, O, U, Y0
Joneoneone
SCHTSCH, SCHTSH, SCHTCH, SHTCH, SHCH, SHTSH, STCH, STSCH, STRZ, STRS, STSH, SZCZ, SZCS2fourfour
SHT, SCHT, SCHD, ST, SZT, SHD, SZD, SD24343
CSZ, CZS, CS, CZ, DRZ, DRS, DSH, DS, DZH, DZS, DZ, TRZ, TRS, TRCH, TSH, TTSZ, TTZ, TZS, TSZ, SZ, TTCH, TCH, TTSCH, ZSCH, ZHSH, SCH, SH, TTS, TC, TS, TZ, ZH, ZSfourfourfour
SC2fourfour
DT, D, TH, T333
CHS, KS, Xfive5454
S, Zfourfourfour
CH, CK, C, G, KH, K, Qfivefivefive
MN, NM6666
M, N666
FB, B, PH, PF, F, P, V, W777
Hfivefive
Leighteighteight
R999
"Alternative" variants of letter combinations (used to generate several alternative codes from the source word):

CH β†’ TCH

CK β†’ TSK

C β†’ TZ

J β†’ DZH



Examples


095747 β†’ Arhiptsev , Arhiptsov , Arkhipychev, Artsybasov, Artsybashev, Archibasov

095757 β†’ Arkhipkov, Arhiptsev , Arhiptsov , Arhipychev

584360 β†’ Galstyan, Galustyan, Gilstein, Glistin, Gluzdan, Golstein, Goldshtein, Goldstein, Kalustian, Khlistun, Khlystun, Khlyustin.



To the same code, this algorithm converts an average of 5 surnames.



Subsequently, Alexander Bader and Stephen Morse developed the Beider-Morse Name Matching Algorithm , aimed at reducing the number of "false positive" values ​​relative to the Daitch-Mokotoff Soundex when working with Jewish (Ashkenazi) surnames.



Metaphone



The Metaphone algorithm (1990) has somewhat better characteristics, which differs from the previous algorithms in a slightly different approach to the coding process: it transforms the source word taking into account the rules of the English language, using noticeably more complex rules, and much less information is lost, since the letters do not split into groups. The resulting code is a set of characters from the set 0BFHJKLMNPRSTWXY , at the beginning of the word there can also be vowels from the set AEIOU .



Metaphone code calculation algorithm
  1. Delete all duplicate adjacent letters except the letter C.
  2. Start the words convert according to the following rules:

    KN β†’ N

    GN β†’ N

    PN β†’ N

    AE β†’ E

    WR β†’ R
  3. Remove the letter B at the end if it goes after M.
  4. Replace C with the following rules.

    On X: CIA β†’ XIA, SCH β†’ SKH, CH β†’ XH

    On S: CI β†’ SI, CE β†’ SE, CY β†’ SY

    On K: C β†’ K
  5. Replace D with the following rules.

    On J: DGE β†’ JGE, DGY β†’ JGY, DGI β†’ JGY

    On T: D β†’ T
  6. Replace GH β†’ H if this letter combination is not at the end and not before a vowel.
  7. Replace GN β†’ N and GNED β†’ NED if these letter combinations are at the end.
  8. Replace G with the following rules.

    On J: GI β†’ JI, GE β†’ JE, GY β†’ JY

    on K: G β†’ K
  9. Remove all Hs after vowels, but not before vowels.
  10. Perform subsequent transformations by the rules:

    CK β†’ K

    PH β†’ F

    Q β†’ K

    V β†’ F

    Z β†’ S
  11. Replace S with X:

    SH β†’ XH

    SIO β†’ XIO

    SIA β†’ XIA
  12. Replace T with the following rules.

    On X: TIA β†’ XIA, TIO β†’ XIO

    On 0: TH β†’ 0

    Delete: TCH β†’ CH

  13. At the beginning of the word convert WH β†’ W. If after W there is no vowel, then delete W.
  14. If X is at the beginning of a word, then convert X β†’ S, otherwise X β†’ KS
  15. Delete all Y that are not before vowels.
  16. Remove all vowels except the initial one.


Examples


AKXN β†’ Agashin, Akachenok, Akishin, Aksionenko, Aksionov, Akchunaev, Akshanov, Akshentsev, Akshinsky, Akshintsev, Akshonov.

FSLX β†’ Vasilishin, Vasilchak, Vasilchenko, Vasilchik, Vasilchikov, Vasilchenko, Vasilchuk, Vasiljushchenko.

SRFM β†’ Serafimov, Serafimsky, Serafimchuk, Cereyfman.



The same Metaphone code value has an average of 6 last names.



Double Metaphone



Double Metaphone (2000) is somewhat different from other phonetic algorithms, generating not one but two codes (both up to 4 characters long) from the source word - one reflects the basic pronunciation of the word, the other the alternative version. It has a large number of different rules that take into account, among other things, the different origins of words, paying attention to Eastern European, Italian, Chinese words, and so on. The transformation rules are quite numerous, I will not publish them, and those who wish can read about them in an article in the journal Dr Dobbs .



Examples


JXRF β†’ Gisharov .

KKRF β†’ Gagarov, Kagarov, Kacharovsky, Kacherovsky , Kachurivsky, Kachurov, Kachurovsky, Kicherov, Kokarev, Kokourov, Kokourov, Kocharov, Kochurov, Kukarev, Tsakirov, Tsokurov, Tsugrov.

KXRF β†’ Gisharov , Gocharov, Kacherov, Kacherovsky , Kasharevsky, Kocharov, Kocherev, Kocheryaev, Kochuraev, Kosharev, Kosherov.

PNFS β†’ Banovsky, Bakhnovsky, Binevsky, Binyavsky, Buynovsky, Buyanovsky, Panevsky, Panovsky, Panovsky, Penevsky, Pinevsky, Piunovsky, Pikhnovsky.



Double Metaphone matches an average of 8-9 last names with the same code.



Russian Metaphone



In 2002, an article by Pyotr Kankowski was published in the 8th edition of the Programmer magazine, describing his adaptation of the English version of the Metaphone algorithm to the severe Siberian frosts, bears and balalaikas. This algorithm transforms the original words in accordance with the rules and norms of the Russian language, taking into account the phonetic sound of unstressed vowels and possible β€œmergers” of consonants in pronunciation. It shows very good results in practice, despite the fact that it is based on fairly simple rules. All letters are divided into groups by sound - vowels and consonants ( vowels and consonants, respectively, in English terminology), deaf and voiced. Voiced consonants are converted into their corresponding deaf pair, the β€œmerging” when pronouncing a sequence of letters is combined, and some other manipulations are performed. Below I will give a slightly modified version, which, unlike the original by Peter Kankowski, introduces rules related to the phonetic equivalence of C and TS or DC , and does not compress the endings - saving bytes is not our task.



Algorithm for calculating the Russian Metaphone code
  1. For all vowels, do the following.

    YO, IO, IT, IE β†’ AND

    Oh, I, I β†’ A

    E, E, E β†’ I

    Yu β†’ U
  2. For all consonants, followed by any consonant, except L , M , H, or P , or for consonants at the end of a word, hold a stun:

    B β†’ P

    H β†’ C

    D β†’ T

    In β†’ F

    G β†’ K
  3. We stick together the TS and the DS in C :

    TS β†’ TS
As a result, the algorithm copes with its task very well - the resulting set contains really phonetically similar words. And while there are quite a few extra words, mainly due to the fact that vowels are not ignored, but are converted and used in the final code. However, there are some words which, in spite of their phonetic similarity, do not fall into the result set due to too β€œstrict” rules of the algorithm.



In the case of Adolf Schwardsegger, the result of the work of the algorithm of the Russian Metaphone will be:



image



Thus, the algorithm in this case reflects the real phonetic similarity of these two last names.



Examples


VITAFSKIY β†’ Vitavsky, Vitovsky.

VITINBIRK β†’ Vitenberg, Wittenberg.

NASANAF β†’ Nasanov, Nasonov, Nassonov, Nosonov.

PIRMAKAF β†’ Permakov, Permyakov, Permyakov.



This algorithm converts to the same code an average of 1-2 surnames.



Caverphone



The Caverphone algorithm was developed in 2002 as part of one of the New Zealand projects for comparing data in old and new electoral lists; therefore, it is most focused on local pronunciation, although it gives quite acceptable results for Russian surnames.



Caverphone code calculation algorithm
  1. Convert the name or surname to lower case (the algorithm is case sensitive).
  2. Remove the letters e at the end.
  3. Transform the beginning of the word according to the following table (relevant for local New Zealand names and surnames). The figure 2 means the time stamp for the consonant letter, which will subsequently be deleted.
    coughroughtoughenoughgnmb
    cou2frou2ftou2fenou2f2nm2
  4. Perform character replacements according to the following table:
    cqcicecytchcqxvdgtiotiadphbshz
    2qsisesy2chkkkf2gsiosiatfhps2s
  5. Replace all vowels at the beginning of the word with A , in other cases - with 3 . Thus, the number 3 serves as a timestamp for a vowel that will be used in subsequent transformations, and then will be deleted. After, it is necessary to make replacements in the following tables (notation: s + - several consecutive characters, ^ h - the character at the beginning of the line, w $ - the character at the end of the line):
    j^ y3^ yy3gh3ghgs +t +p +k +f +m +
    yY3A33kh322kSTPKFM
    n +w3wh3w $w^ hhr3r $rl3l $l
    NW3Wh332A2R332L332
  6. Delete all digits 2 . If the number 3 remains at the end of the word, then replace it with A. Then delete all numbers 3 .
  7. Crop the word to 10 characters, or add up to 10 characters in units.


Examples


KPRLN11111 β†’ Gabrelyan, Gabrielyan, Gabrielyan, Kaparulin, Kapralin, Kaprelyan.

MSRFK11111 β†’ Meizerovich, Misarovich, Misyurevich.

PLLF111111 β†’ Balalaev, Balaliev, Balaluyev, Bilaliev, Bilalov, Bilyalov, Bolelov, Palilov, Polilov, Polulyakhov.



Caverphone matches the same code with about 4-5 names.



Total



Most of these algorithms are implemented in a variety of languages, including C, C ++, Java, C #, and PHP. Some of them, such as Soundex and Metaphone, are integrated or implemented as plug-ins for many popular DBMSs, and are also used as part of full-fledged search engines, for example, Apache Lucene . The area of ​​their application is quite specific, because a significant increase in convenience for users can be achieved only when searching for names, but nevertheless, their competent use is a plus for search engines.



Links

  1. Java code for the article. Yandex.Disk
  2. Implements Soundex, Refined Soundex, Metaphone, Double Metaphone, Caverphone in Java.

    Apache commons codec
  3. Implementing NYSIIS in Java. Project egothor
  4. The implementation of Daitch-Mokotoff Soundex in Java. http://joshualevy.tripod.com/genealogy/dmsoundex/dmsoundex.zip
  5. Description Soundex. http://en.wikipedia.org/wiki/Soundex
  6. Description Daitch-Mokotoff Soundex. http://www.jewishgen.org/infofiles/soundex.html
  7. Description of NYSIIS.

    http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
  8. Description Metaphone. http://en.wikipedia.org/wiki/Metaphone
  9. Description Double Metaphone. http://www.drdobbs.com/article/184401251
  10. Description of Russian Metaphone. http://forum.aeroion.ru/topic461.html
  11. Description Caverphone. http://en.wikipedia.org/wiki/Caverphone
  12. Online demo Soundex. http://www.gedpage.com/soundex.html
  13. NYSIIS online demo. http://www.dropby.com/NYSIIS.html
  14. Online demo Daitch-Mokotoff Soundex. http://stevemorse.org/census/soundex.html
  15. Online demo Metaphone. http://www.searchforancestors.com/utility/metaphone.php

Source: https://habr.com/ru/post/114947/



All Articles