📜 ⬆️ ⬇️

Automatic detection of free text language on PHP - library PHPLangautodetect

You know, working with a startup that is trying to create something new and unique on the market is very exciting. And not only the opening opportunities, but also often non-trivial tasks and questions that are posed to the creators and which no one has solved before. One of such questions just appeared before me yesterday: an arbitrary line of text is given to us, it is known that it can be two, and in some cases trilingual, that is, there is a mixed text from several languages. It is necessary to determine for the user the language in which the text is written.

In fact, the task is not so rare - similar functionality exists in text editors, in the PuntoSwitcher keyboard switch, and in the machine translation systems such functionality is in demand, not to mention information retrieval systems. By the way, it was in the context of creating a specialized search engine and text classifier that such a problem appeared. It was necessary to get such an opportunity in my own program on the PHP platform and not to use third-party services - a similar feature in the form of a web service is present in the Google Language API ( in my blog I have already investigated this service ), but it runs remotely and has some significant for us, the restrictions, in particular, the procedure of language identification is performed with a significant delay and is asynchronous in nature. In addition, I really wanted to have full control over the process and be able to configure it flexibly, which, alas, is not in third-party services. Therefore, I had to think and try to implement it on our own, but we present the result to your attention.


First, a little theory. Immediately it should be said that the process of automatic language detection is inaccurate and fundamentally probabilistic. That is, the result is always given with any probability, especially for languages ​​that have a very similar or even identical alphabet (in writing), but they are different. At the same time, we also depend on the length of the line of the text under study - the less material we have for research, the more difficult or even impossible such a definition is. Indeed, for statistics it is necessary to have more field for calculating parameters, and in the short line we cannot get enough material for identification, especially when analyzing languages ​​that have basically the same alphabet. In such a text it is banal that unique letters may simply not appear and it will be defined as a word of another language. Therefore, the first limitation of the analysis method of the used alphabet is the length of the text - the longer it is, the more accurate the analysis. Let me give you an example: the word " rappel ". What language is it in? In English? It means "go down on the rope." But the same word is in German! And there it means "(sudden) insanity, an attack of rabies."
')
This method has two varieties. The variant of using the “percentage of use of the alphabet” uses the calculation of the number of unique symbols used in the alphabet in the text and the calculation of% of the total volume. The second changes the number of characters from the text, which coincides with the alphabet, while some characters can fall into different alphabets and be counted in both languages.

The second method is based on the use of pre-formed rules that establish the identity of the text using unique or grammatical language sequences of letters (for example, articles in English, letters "" and "" in Russian or "є" in Ukrainian). Such rules for n-grams can be developed by linguists and make it possible to more quickly and more accurately determine the language of the text, but also do not give a guaranteed result. First you need to create them, which means you need to master the language at a sufficient level, and not so many unique characteristic sequences in different languages. Although, if you know in advance which languages ​​you need to define, there may be more unique combinations between them than if you use all languages. If you only have Russian and English, then there are obviously more such letter combinations than in a pair of German-English.

We should also dwell on the case when the words of different alphabets are mixed in the text. For example, the names or names of companies and products may be written in the original language, most often English, but the whole sentence is formulated in Russian. Here only the option of calculating the total number of characters that belong to the alphabet and on the basis of whose characters are more, to make decisions will help. In the future version, the library will be able to recognize such variants and produce an array of languages ​​that are found in the string; now it will show only the main language, the one with the most characters in the string.

In my library, I decided to use both approaches and all possible options, allowing the flexibility to configure the algorithm to use either approach or a combination of them. Technically, the library is very simple - one class, several methods and properties that set the parameters of the work. Line by line it is commented in the source code, so here I will not give the whole description of the implementation, I’ll dwell only on the features.

The library works with texts in UTF-8 encoding; therefore, it requires the mb_strings module and first of all leads the resulting string to a standard form, trying to recode it, then deletes the extra characters and checks the length. The minimum amount of text is 50 characters, the maximum is 1680, which is approximately equal to one standard A4 page.

You can specify various detection options. The library can use the analysis of alphabets, while looking at either the total amount of text, or the percentage of letters used in each alphabet. The decision threshold is also adjustable, by default it is 75% (depending on the calculation, it is either 75% of the letters of the alphabet or in the text the total number of characters of this language is more than 75%). It is also possible to use the heuristic rules to clarify the result, while you can adjust the priority - if the rule does not confirm the result of the analysis according to statistics, then it is more correct to consider the result of the rules or you should trust the statistics. For faster work, especially on large amounts of text or a large number of languages, only rules can be used, they are usually much smaller than alphabetic characters. By the way, the use of rules is also set up - to get the result, you can use it as a match with one of the rules, or require matching with all the rules of one language, however this applies only for long texts and there will always be an error probability.

Returns the library after detection or false, which means that it is impossible to determine or that the language being used is not in the database. If successful, we get an array with two letter code of the language (for example: “en”, “ru” or “ua”), as well as additional information - the full name of the language and, as a bonus, a link to the article about the language in Wikipedia. org (of course, in the same language).

So far, the first version of the library is able to work only with three languages ​​- English, Russian and Ukrainian, although nothing prevents to add additional alphabets and rules for working with any languages.

Finally, one note about speed. The fastest option will be to use only the rules, since there are always fewer of them than letters and we will use shorter cycles inside the library. In particular, the longer the text and the more languages ​​we have identified in the database for the search, the faster will be the option only with the rules. Therefore, for optimization, it is best for you to limit the set of languages ​​in advance to the most probable and remove those that you do not need - this will reduce a significant number of cycles and the algorithm will run faster. You can also remove the check and decode strings, if you are sure that your system will only have UTF-8 strings converted to the UTF-8 encoding to the algorithm input.

Official project site : http://code.google.com/p/phplangautodetect/
License : GNU General Public License v3
Author : Alexander Lozovyuk (aleks_raiden, aleks.raiden@gmail.com)
Language / platform : PHP 5 (requires mb_strings module)
The distribution kit includes the simplest experiment script, the latest version online here .

Below is the source code with comments and comments on the implementation of the algorithms described above.

  1. class Lang_Auto_Detect
  2. {
  3. // main variables
  4. // list of supported languages
  5. public $ lang = Array ( 'en' => array ( 'English' , 'http://en.wikipedia.org/wiki/English_language' ),
  6. 'en' => array ( 'Russian' , 'http://ru.wikipedia.org/wiki/%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0 % B9_% D1% 8F% D0% B7% D1% 8B% D0% BA ' ),
  7. 'ua' => array ( 'Ukraine' , 'http://uk.wikipedia.org/wiki/%D0%A3%D0%BA%D1%80% D0%B0%D1% 97% D0% B0% D1 % 81% D1% 8C% D0% BA% D0% B0_% D0% BC% D0% BE% D0% B2% D0% B0 ' )
  8. );
  9. // threshold of sensitivity, how many% of the language symbols should be in order for it to be defined
  10. public $ detect_range = 75;
  11. // Whether to handle multilingual documents and return an array of used languages
  12. public $ detect_multi_lang = false ; // not yet implemented
  13. // return all results and probabilities
  14. public $ return_all_results = false ; // in real use it is better to disable
  15. // use additionally a system of rules and exceptions
  16. public $ use_rules = false ;
  17. // apply only the rules (much faster, but the result is less likely, the more text, the more reliable)
  18. public $ use_rules_only = false ;
  19. // priority of rules over statistics -
  20. public $ use_rules_priory = true ; // true - rules take precedence over statistics, false - statistics before rules
  21. // search only the first rule or maximum matches?
  22. public $ match_all_rules = false ; // just one thing = all
  23. // use% of the alphabet or the total number of characters of each alphabet
  24. public $ use_str_len_per_lang = true ; // true - use the total length of the text is more priority than% of the characters of the alphabet, false - vice versa
  25. // minimum string length for detection
  26. public $ min_str_len_detect = 50;
  27. // for normal performance, set the maximum length in characters to compare
  28. public $ max_str_len_detect = 1680; //
  29. // internal non-variable - a table of alphabets used in the definition
  30. private $ _langs = array (
  31. 'en' => array ( 'a' , 'b' , 'c' , 'd' , 'e' , 'f' , 'g' , 'h' , 'i' , 'j' , 'k' , 'l' , 'm' , 'n' , 'o' , 'p' , 'q' , 'r' , 's' , ' t ' , ' u ' , ' v ' , ' w ' , ' x ' , ' y ' , ' z ' ),
  32. 'ru' => array ( 'a' , 'b' , 'c' , 'd' , 'd' , 'e' , 'e' , 'f' , 's' , 'u' , 'd' , "k" , "l" , "m" , "n" , "o" , "n" , "p" , "c" , "t" , "y" , "f" , "x" , " , " " , " " , " " , " " , " s " , " " , " er " , " u " , " I " ),
  33. 'ua' => array ( 'a' , 'b' , 'c' , 'd' , 'ґ' , 'd' , 'e' , 'є' , 'w' , 's' , 'and' , 'і' , 'ї' , 'y' , 'k' , 'l' , 'm' , 'n' , 'o' , 'p' , 'p' , 's' , ' t ' , ' y ' , ' f ' , ' x ' , ' c ' , ' h ' , ' sh ' , ' u ' , ' Ü ' , ' yu ' , ' I ' )
  34. );
  35. // stores the rules
  36. // rules are characters or strings, the presence of which (any or all) automatically leads to the identification of the text
  37. private $ _lang_rules = array (
  38. 'en' => array ( 'th' , 'ir' ),
  39. 'ru' => array ( '' , '' ),
  40. 'ua' => array ( 'ї' , 'є' )
  41. );
  42. // class constructor
  43. public function __construct ()
  44. {
  45. return true ;
  46. }
  47. // preparation of the entered string for survival
  48. private function _prepare_str ($ tmp_str = null )
  49. {
  50. if ($ tmp_str == null ) return false ; // if nothing is passed, exit
  51. $ tmp_str = trim ($ tmp_str);
  52. $ tmp_encoding = mb_detect_encoding ($ tmp_str);
  53. if (mb_strlen ($ tmp_str, $ tmp_encoding)> $ this -> max_str_len_detect)
  54. {
  55. // cut the length of the text, for productivity
  56. $ tmp_str = mb_substr ($ tmp_str, 0, $ this -> max_str_len_detect, $ tmp_encoding);
  57. }
  58. else
  59. if (mb_strlen ($ tmp_str, $ tmp_encoding) <= $ this -> min_str_len_detect) return false ;
  60. // convert encodings
  61. $ tmp_str = mb_convert_encoding ($ tmp_str, 'UTF-8' , $ tmp_encoding);
  62. // reduce everything to lower case
  63. $ tmp_str = mb_strtolower ($ tmp_str, 'UTF-8' );
  64. return $ tmp_str;
  65. }
  66. // function of determining the language according to the rules
  67. // the rules unambiguously determine the language, but they can be :)
  68. private function _detect_from_rules ($ tmp_str = null )
  69. {
  70. if ($ tmp_str == null ) return false ; // if nothing is passed, exit
  71. if (! is_array ($ this -> _ lang_rules)) return false ;
  72. // enumerate all rules
  73. foreach ($ this -> _ lang_rules as $ lang_code => $ lang_rules)
  74. {
  75. $ tmp_freq = 0;
  76. foreach ($ lang_rules as $ rule)
  77. {
  78. $ tmp_term = mb_substr_count ($ tmp_str, $ rule);
  79. if ($ tmp_term> 1) // i.e. a character in order 1 or more times
  80. {
  81. $ tmp_freq ++; // increase the count of the characters of the language in this line
  82. }
  83. // now check
  84. if ($ this -> match_all_rules === true )
  85. {
  86. // need to match all the rules
  87. if ($ tmp_freq == count ($ lang_rules)) return $ lang_code;
  88. }
  89. else
  90. {
  91. // one is enough
  92. if ($ tmp_freq> 0) return $ lang_code;
  93. }
  94. }
  95. }
  96. return false ;
  97. }
  98. // function of determining the language on the table
  99. private function _detect_from_tables ($ tmp_str = null )
  100. {
  101. if ($ tmp_str == null ) return false ; // if nothing is passed, exit
  102. // we already have to process the string for comparison before
  103. // go through all the languages ​​and determine the probability for each
  104. $ lang_res = array ();
  105. foreach ($ this -> lang as $ lang_code => $ lang_name)
  106. {
  107. $ lang_res [$ lang_code] = 0; // default is 0, that is, not this language
  108. $ tmp_freq = 0; // character frequency of the current language
  109. $ full_lang_symbols = 0; // full number of characters of this language
  110. // since the length of the string can be arbitrary, and the alphabet is the same, then loop over the alphabets
  111. $ cur_lang = $ this -> _ langs [$ lang_code];
  112. foreach ($ cur_lang as $ l_item)
  113. {
  114. // now see the number of occurrences of a character in a string
  115. $ tmp_term = mb_substr_count ($ tmp_str, $ l_item);
  116. if ($ tmp_term> 1) // i.e. a character in order 1 or more times
  117. {
  118. $ tmp_freq ++; // increase the count of the characters of the language in this line
  119. $ full_lang_symbols + = $ tmp_term;
  120. }
  121. }
  122. if ($ this -> use_str_len_per_lang === true )
  123. {
  124. // use total characters
  125. $ lang_res [$ lang_code] = $ full_lang_symbols;
  126. }
  127. else
  128. // Calculate the percentage of all characters in the alphabet
  129. $ lang_res [$ lang_code] = ceil ((100 / count ($ cur_lang)) * $ tmp_freq);
  130. }
  131. // so, now let's see what happened
  132. arsort ($ lang_res, SORT_NUMERIC); // sort the array first element language with greater probability
  133. if ($ this -> return_all_results == true )
  134. {
  135. return $ lang_res; // if all results are returned, we return, otherwise select the best one
  136. }
  137. else
  138. {
  139. // if more than the specified threshold, return the language code, otherwise - null (that is, we can not determine the language code)
  140. $ key = key ($ lang_res);
  141. if ($ lang_res [$ key]> = $ this -> detect_range)
  142. return $ key;
  143. else
  144. return false ;
  145. }
  146. }
  147. // common function to determine the language
  148. public function lang_detect ($ tmp_str = null )
  149. {
  150. if ($ tmp_str == null ) return false ; // if nothing is passed, exit
  151. $ tmp_str = $ this -> _ prepare_str ($ tmp_str);
  152. if ($ tmp_str === false ) return false ;
  153. // if the rules apply to the table
  154. if ($ this -> use_rules_only === true )
  155. {
  156. $ res = $ this -> _ detect_from_rules ($ tmp_str);
  157. return array ($ res, $ this -> lang [$ res]);
  158. }
  159. else
  160. {
  161. // when using tables, we can not get a full layout of the results, because disable
  162. $ this -> return_all_results = false ;
  163. $ res = $ this -> _ detect_from_tables ($ tmp_str);
  164. if ($ tmp_str === false ) return false ;
  165. if ($ this -> use_rules === true )
  166. {
  167. $ res_rules = $ this -> _ detect_from_rules ($ tmp_str);
  168. // proceed from the priority settings of rules and statistics
  169. if ($ this -> use_rules_priory === true )
  170. {
  171. // rules are more powerful than statistics
  172. return array ($ res_rules, $ this -> lang [$ res_rules]);
  173. }
  174. else
  175. {
  176. return array ($ res, $ this -> lang [$ res]);
  177. }
  178. }
  179. else
  180. return array ($ res, $ this -> lang [$ res]);
  181. }
  182. }
  183. }
* This source code was highlighted with Source Code Highlighter .

PS Of course, the code does not claim perfection, and, perhaps, the network already has implementations of this functionality, which I did not find. If you know known implementations - please let me know in the comments. The original article is posted on my blog .

Source: https://habr.com/ru/post/27378/


All Articles