📜 ⬆️ ⬇️

Search for various variants of transliteration of Russian texts (full name) in Latin for samples in MySQL

Actually, I ran into a banal task - there is a base in which there is a mass of full name data written in roman letters. Often quite illiterate and unpredictable transliterated. It seems to be the GOST 7.79 (ISO 9), but in practice it is not always respected.
The task is a database with records (as a rule, a full name) in Latin letters. Bad Latin. It is required for a Russian search string in a web application to select from a database (MySQL) records that seem to be transliterated into Latin.


The first thought is to write a function that would return an array containing the possible spelling of the original phrase in Latin.
Source 1, allowing you to look at how a banal string can be transliterated
print_r (r2e ("     ")); function r2e ($txt) { $r=array(mb_strtolower($txt,"windows-1251")); $r=r2es($r,"//i",array("x","ks","cs")); $r=r2es($r,"//i",array("ey","ei")); $r=r2es($r,"//i",array("ov","off")); $r=r2es($r,"//i",array("a")); $r=r2es($r,"//i",array("b")); $r=r2es($r,"//i",array("v","w")); $r=r2es($r,"//i",array("g")); $r=r2es($r,"//i",array("d")); $r=r2es($r,"//i",array("e")); $r=r2es($r,"//i",array("yo","jo")); $r=r2es($r,"//i",array("zh","z")); $r=r2es($r,"//i",array("z","s")); $r=r2es($r,"//i",array("i")); $r=r2es($r,"//i",array("j","y")); $r=r2es($r,"//i",array("k","c")); $r=r2es($r,"//i",array("l")); $r=r2es($r,"//i",array("m")); $r=r2es($r,"//i",array("n")); $r=r2es($r,"//i",array("o")); $r=r2es($r,"//i",array("p")); $r=r2es($r,"//i",array("r")); $r=r2es($r,"//i",array("s")); $r=r2es($r,"//i",array("t")); $r=r2es($r,"//i",array("u")); $r=r2es($r,"//i",array("f")); $r=r2es($r,"//i",array("h","kh")); $r=r2es($r,"//i",array("c","ts")); $r=r2es($r,"//i",array("ch")); $r=r2es($r,"//i",array("sh")); $r=r2es($r,"//i",array("shch","sch","sh")); $r=r2es($r,"//i",array("")); $r=r2es($r,"//i",array("y")); $r=r2es($r,"//i",array("")); $r=r2es($r,"//i",array("e")); $r=r2es($r,"//i",array("u","yu","ju")); $r=r2es($r,"//i",array("ya","ja")); return $r; } function r2es ($var, $pattern, $splits) { $sp=array(); $nsp=array(); foreach ($var as $v) if (preg_match($pattern,$v)) foreach ($splits as $split) $sp=array_merge($sp,array(preg_replace($pattern,$split,$v))); else $nsp=array_merge($nsp,array($v)); return array_merge($sp,$nsp); } 



The contents of the output array from example 1
Array
(
[0] => ochen prostaya fraza na russkom yazyke
[1] => ochen prostaja fraza na russkom jazyke
[2] => ochen prostaya fraza na russcom yazyce
[3] => ochen prostaja fraza na russcom jazyce
[4] => ochen prostaya frasa na russkom yasyke
[5] => ochen prostaja frasa na russkom jasyke
[6] => ochen prostaya frasa na russcom yasyce
[7] => ochen prostaja frasa na russcom jasyce
)

')
it seemed to work well, even informatively, but the resulting array should also be glued together in the appendix to the request as
 $sqlstr = "SELECT <smth> FROM <smwhr> WHERE searchField LIKE '%" . implode("%' OR searchField LIKE '%",$result) . "%'"; ,     $sqlstr = "SELECT <smth> FROM <smwhr> WHERE searchField LIKE '" . implode("' OR searchField LIKE '",$result) . "'"; 


Well and technically alarming that the SQL query can be quite easily lengthened at times.
But with minimal modifications it will work with almost any database engine.

The solution of the problem turned out to be much smaller - instead of “SELECT ... WHERE field LIKE 'pattern'”, MySQL supports the advanced constructions “SELECT ... WHERE field REGEXP 'regexp'”, allowing you to search by regexp.

Total - compact code with significantly better performance when working with MySQL
So much better
 function r2e ($txt) { $txt=mb_strtolower($txt,"windows-1251"); $sr=array("","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""); $se=array("(x|[kc]s)","e[yi]","o(v|ff)","a","b","[vw]","g","d","e","[yj]o","z(h)*","[zs]","i","[jy]","[kc]","l","m","n","o","p","r","s","t","u","f","(k|c)*h","(c|ts)","ch","sh","s(h)*(c)*h","","y","","e","[yj]*u","[yj]a"); return str_replace($sr,$se,$txt); } 


the result is a string of the form
 ochen prosta[yj]a fra[zs]a na russ[kc]om [yj]a[zs]y[kc]e 

quite suitable for insertion into the "SELECT columns FROM table WHERE searchField REGEXP '$ txt'" as, in fact, $ txt.

And what about other popular DBMS?
- Unfortunately, with MSSQL everything turned out to be not so rosy - although you can use the CLR module (for example, from here ) and get support for regular expressions.
- With T-SQL (Oracle) everything is fine, at least from the 10G version there is a REGEXP_LIKE (fieldName, regexpString).
- in PostgreSQL there is a long time ago regexp_matches (fieldName, regexpString).
- in FireBird you can use SIMILAR TO, and you will have to rework the syntax, for example, for the nicely Russian "Y" instead of s (h) * (s) * h you will need to describe it as s (h) {0,1} {0,1} h and hereinafter by analogy.

Source: https://habr.com/ru/post/202528/


All Articles