📜 ⬆️ ⬇️

A simple example of phonetic search

Formulation of the problem


There is a database containing a list of Russian and Ukrainian names, surnames in English transcription, as it is recorded in travel passports. Since some time ago, the rules of transcription for these passports in Russia have changed (felts from English to French, toli vice versa), there is a very real and even official possibility that any name can be written differently. In addition, the data can sometimes be taken from a sea passport, which makes the situation more complicated.
Now imagine that you need to quickly find a person in this database by name, for example, Shcheglov ... (smile)

Solution options


The existing algorithms did not like either the orientation towards pure English, or the complete impossibility of “hot search” (the last name should be entered as a whole, and only then compared). And then I remembered one fairly simple algorithm that I wrote many years ago for one Greek project, where a similar problem was even in a more rigid version: operators had to catch names (Greek) there by ear, by phone. The algorithm description was given to me by my then companion, calling it “voel”. Greek and Russian, of course, do not look much alike, but porridges with transcribing are quite similar, and I decided to risk redoing the above mentioned “voel” for Russian needs.


Some necessary explanations


Many years ago - when the sun was brighter, the grass was greener, girls were more mysterious, and fat books were written about the word RAD, the author of these lines with a pair of like-minded people earned the best of their strength for bread and butter writing backends for small and medium-sized Greek companies.
Since then, much has changed, in particular, the author’s activity for many years has not been associated with ah-ti in general, and with programming in particular. Dull human resource management, gray days of office rats.
They say, however, that programmers are not former, and when, due to the office relocation, there was a question about switching to paperless office work, I ventured to start writing a system, the needs are rather modest.
In my opinion, some interesting and / or controversial parts of the project I ventured to share in a modest hope of useful criticism and valuable comments.
So, "Shcheglov".
')

Proposed Solution


In general, the mentioned “voel” did a very simple thing: it translated the whole word into one register, for example, into the lower one, and replaced some letters or combinations of letters with their “phonetic matches”, that is, simply with other letters (much less often combinations).
I tried to build a similar table of "phonetic correspondences" for the Russian language, and it turned out something like the following (comments are welcome):

Firstly - all kinds of double consonants. We remove, one is enough:
bb = bkk = krr = r
cc = cll = lss = s
dd = dmm = mtt = t
ff = vnn = nzz = z
hh = hpp = p


Next - a variety of hissing-whistling:
sh = szch = sck = k
ch = csch = sks = x
shch = scsh = sts = c
zhch = szh = ztc = c


Then the rest of the branded chips of the Russian language, such as "u", "I", "", "nd", "f" and mn:
yu = uje = eoy = oi
ju = uei = eioj = oi
u = uey = eiph = f
ya = aej = eiyy = i
ja = ayo = eii = i
ia = aio = eiy = i
ye = ejo = ey = i
ie = eoi = oiyy = i


Well, and so on, the rest:
kh = hgh = g'=


That is, whether the aforementioned Shcheglov be recorded as “Shcheglov”, “Scheglov” or even “Zchegloff” - with the help of this table it will be translated into the unique “seglov”.
It remains to write the code.

TVoel class


In the example below, Delphi is used.
The correspondence table is read from a file into a TStringList type sheet, and sorted to enable binary search in it. The Locate function performs this search. Implementations of the corresponding functions are omitted beyond commonplace.

type TVoel = class private FFileName: String; FList: TStrings; procedure setFileName(const Value: String); procedure readFile; function isReady: Boolean; function Locate(const Value: String; var Index: Integer): Boolean; public constructor Create; destructor Destroy; override; function Convert(const Value: String): String; property FileName: String read FFileName write SetFileName; end; implementation function TVoel.Convert(const Value: String): String; var ii, p, len: Integer; str: String; Ch: String; found: Boolean; begin Result := ''; len := Length(Value); ii := 0; while ii < len do begin Inc(ii); Ch := Value[ii]; found := Locate(Ch, p); if found then begin str := Value[ii]; while found and (ii < len) do begin Inc(ii); str := str + Value[ii]; found := Locate(str, p); end; if not found then begin setlength(str, length(str)-1); Dec(ii); end; if CompareText(str, FList.Names[p]) = 0 then Ch := FList.ValueFromIndex[p]; end; Result := Result + Ch; end; Result := ANSIUpperCase(Result); end; 


At the moment, the above class is used to “debug” the “phonetic table” on a modest list of tens of thousands of people. Of course, the final implementation will (must be) written in the form of a procedure built into the database.

Source: https://habr.com/ru/post/120182/


All Articles