
You know, working with a startup that is trying to create something new and unique on the market is very exciting. And not only the opening opportunities, but also often non-trivial tasks and questions that are posed to the creators and which no one has solved before. One of such questions just appeared before me yesterday: an arbitrary line of text is given to us, it is known that it can be two, and in some cases trilingual, that is, there is a mixed text from several languages. It is necessary to determine for the user the language in which the text is written.
In fact, the task is not so rare - similar functionality exists in text editors, in the PuntoSwitcher keyboard switch, and in the machine translation systems such functionality is in demand, not to mention information retrieval systems. By the way, it was in the context of creating a specialized search engine and text classifier that such a problem appeared. It was necessary to get such an opportunity in my own program on the PHP platform and not to use third-party services - a similar feature in the form of a web service is present in the Google Language API (
in my blog I have already investigated this service ), but it runs remotely and has some significant for us, the restrictions, in particular, the procedure of language identification is performed with a significant delay and is asynchronous in nature. In addition, I really wanted to have full control over the process and be able to configure it flexibly, which, alas, is not in third-party services. Therefore, I had to think and try to implement it on our own, but we present the result to your attention.
First, a little theory. Immediately it should be said that the process of automatic language detection is inaccurate and fundamentally probabilistic. That is, the result is always given with any probability, especially for languages that have a very similar or even identical alphabet (in writing), but they are different. At the same time, we also depend on the length of the line of the text under study - the less material we have for research, the more difficult or even impossible such a definition is. Indeed, for statistics it is necessary to have more field for calculating parameters, and in the short line we cannot get enough material for identification, especially when analyzing languages that have basically the same alphabet. In such a text it is banal that unique letters may simply not appear and it will be defined as a word of another language. Therefore, the first limitation of the analysis method of the used alphabet is the length of the text - the longer it is, the more accurate the analysis. Let me give you an example: the word "
rappel ". What language is it in? In English? It means "go down on the rope." But the same word is in German! And there it means "(sudden) insanity, an attack of rabies."
')
This method has two varieties. The variant of using the “percentage of use of the alphabet” uses the calculation of the number of unique symbols used in the alphabet in the text and the calculation of% of the total volume. The second changes the number of characters from the text, which coincides with the alphabet, while some characters can fall into different alphabets and be counted in both languages.
The second method is based on the use of pre-formed rules that establish the identity of the text using unique or grammatical language sequences of letters (for example, articles in English, letters "" and "" in Russian or "є" in Ukrainian). Such rules for n-grams can be developed by linguists and make it possible to more quickly and more accurately determine the language of the text, but also do not give a guaranteed result. First you need to create them, which means you need to master the language at a sufficient level, and not so many unique characteristic sequences in different languages. Although, if you know in advance which languages you need to define, there may be more unique combinations between them than if you use all languages. If you only have Russian and English, then there are obviously more such letter combinations than in a pair of German-English.
We should also dwell on the case when the words of different alphabets are mixed in the text. For example, the names or names of companies and products may be written in the original language, most often English, but the whole sentence is formulated in Russian. Here only the option of calculating the total number of characters that belong to the alphabet and on the basis of whose characters are more, to make decisions will help. In the future version, the library will be able to recognize such variants and produce an array of languages that are found in the string; now it will show only the main language, the one with the most characters in the string.
In my library, I decided to use both approaches and all possible options, allowing the flexibility to configure the algorithm to use either approach or a combination of them. Technically, the library is very simple - one class, several methods and properties that set the parameters of the work. Line by line it is commented in the source code, so here I will not give the whole description of the implementation, I’ll dwell only on the features.
The library works with texts in UTF-8 encoding; therefore, it requires the mb_strings module and first of all leads the resulting string to a standard form, trying to recode it, then deletes the extra characters and checks the length. The minimum amount of text is 50 characters, the maximum is 1680, which is approximately equal to one standard A4 page.
You can specify various detection options. The library can use the analysis of alphabets, while looking at either the total amount of text, or the percentage of letters used in each alphabet. The decision threshold is also adjustable, by default it is 75% (depending on the calculation, it is either 75% of the letters of the alphabet or in the text the total number of characters of this language is more than 75%). It is also possible to use the heuristic rules to clarify the result, while you can adjust the priority - if the rule does not confirm the result of the analysis according to statistics, then it is more correct to consider the result of the rules or you should trust the statistics. For faster work, especially on large amounts of text or a large number of languages, only rules can be used, they are usually much smaller than alphabetic characters. By the way, the use of rules is also set up - to get the result, you can use it as a match with one of the rules, or require matching with all the rules of one language, however this applies only for long texts and there will always be an error probability.
Returns the library after detection or false, which means that it is impossible to determine or that the language being used is not in the database. If successful, we get an array with two letter code of the language (for example: “en”, “ru” or “ua”), as well as additional information - the full name of the language and, as a bonus, a link to the article about the language in Wikipedia. org (of course, in the same language).
So far, the first version of the library is able to work only with three languages - English, Russian and Ukrainian, although nothing prevents to add additional alphabets and rules for working with any languages.
Finally, one note about speed. The fastest option will be to use only the rules, since there are always fewer of them than letters and we will use shorter cycles inside the library. In particular, the longer the text and the more languages we have identified in the database for the search, the faster will be the option only with the rules. Therefore, for optimization, it is best for you to limit the set of languages in advance to the most probable and remove those that you do not need - this will reduce a significant number of cycles and the algorithm will run faster. You can also remove the check and decode strings, if you are sure that your system will only have UTF-8 strings converted to the UTF-8 encoding to the algorithm input.
Official project site :
http://code.google.com/p/phplangautodetect/License :
GNU General Public License v3Author : Alexander Lozovyuk (aleks_raiden, aleks.raiden@gmail.com)
Language / platform : PHP 5 (requires mb_strings module)
The distribution kit includes the simplest experiment script, the
latest version online here .
Below is the source code with comments and comments on the implementation of the algorithms described above.
- class Lang_Auto_Detect
- {
- // main variables
- // list of supported languages
- public $ lang = Array ( 'en' => array ( 'English' , 'http://en.wikipedia.org/wiki/English_language' ),
- 'en' => array ( 'Russian' , 'http://ru.wikipedia.org/wiki/%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0 % B9_% D1% 8F% D0% B7% D1% 8B% D0% BA ' ),
- 'ua' => array ( 'Ukraine' , 'http://uk.wikipedia.org/wiki/%D0%A3%D0%BA%D1%80% D0%B0%D1% 97% D0% B0% D1 % 81% D1% 8C% D0% BA% D0% B0_% D0% BC% D0% BE% D0% B2% D0% B0 ' )
- );
- // threshold of sensitivity, how many% of the language symbols should be in order for it to be defined
- public $ detect_range = 75;
- // Whether to handle multilingual documents and return an array of used languages
- public $ detect_multi_lang = false ; // not yet implemented
- // return all results and probabilities
- public $ return_all_results = false ; // in real use it is better to disable
- // use additionally a system of rules and exceptions
- public $ use_rules = false ;
- // apply only the rules (much faster, but the result is less likely, the more text, the more reliable)
- public $ use_rules_only = false ;
- // priority of rules over statistics -
- public $ use_rules_priory = true ; // true - rules take precedence over statistics, false - statistics before rules
- // search only the first rule or maximum matches?
- public $ match_all_rules = false ; // just one thing = all
- // use% of the alphabet or the total number of characters of each alphabet
- public $ use_str_len_per_lang = true ; // true - use the total length of the text is more priority than% of the characters of the alphabet, false - vice versa
- // minimum string length for detection
- public $ min_str_len_detect = 50;
- // for normal performance, set the maximum length in characters to compare
- public $ max_str_len_detect = 1680; //
- // internal non-variable - a table of alphabets used in the definition
- private $ _langs = array (
- 'en' => array ( 'a' , 'b' , 'c' , 'd' , 'e' , 'f' , 'g' , 'h' , 'i' , 'j' , 'k' , 'l' , 'm' , 'n' , 'o' , 'p' , 'q' , 'r' , 's' , ' t ' , ' u ' , ' v ' , ' w ' , ' x ' , ' y ' , ' z ' ),
- 'ru' => array ( 'a' , 'b' , 'c' , 'd' , 'd' , 'e' , 'e' , 'f' , 's' , 'u' , 'd' , "k" , "l" , "m" , "n" , "o" , "n" , "p" , "c" , "t" , "y" , "f" , "x" , " , " " , " " , " " , " " , " s " , " " , " er " , " u " , " I " ),
- 'ua' => array ( 'a' , 'b' , 'c' , 'd' , 'ґ' , 'd' , 'e' , 'є' , 'w' , 's' , 'and' , 'і' , 'ї' , 'y' , 'k' , 'l' , 'm' , 'n' , 'o' , 'p' , 'p' , 's' , ' t ' , ' y ' , ' f ' , ' x ' , ' c ' , ' h ' , ' sh ' , ' u ' , ' Ü ' , ' yu ' , ' I ' )
- );
- // stores the rules
- // rules are characters or strings, the presence of which (any or all) automatically leads to the identification of the text
- private $ _lang_rules = array (
- 'en' => array ( 'th' , 'ir' ),
- 'ru' => array ( '' , '' ),
- 'ua' => array ( 'ї' , 'є' )
- );
- // class constructor
- public function __construct ()
- {
- return true ;
- }
- // preparation of the entered string for survival
- private function _prepare_str ($ tmp_str = null )
- {
- if ($ tmp_str == null ) return false ; // if nothing is passed, exit
- $ tmp_str = trim ($ tmp_str);
- $ tmp_encoding = mb_detect_encoding ($ tmp_str);
- if (mb_strlen ($ tmp_str, $ tmp_encoding)> $ this -> max_str_len_detect)
- {
- // cut the length of the text, for productivity
- $ tmp_str = mb_substr ($ tmp_str, 0, $ this -> max_str_len_detect, $ tmp_encoding);
- }
- else
- if (mb_strlen ($ tmp_str, $ tmp_encoding) <= $ this -> min_str_len_detect) return false ;
- // convert encodings
- $ tmp_str = mb_convert_encoding ($ tmp_str, 'UTF-8' , $ tmp_encoding);
- // reduce everything to lower case
- $ tmp_str = mb_strtolower ($ tmp_str, 'UTF-8' );
- return $ tmp_str;
- }
- // function of determining the language according to the rules
- // the rules unambiguously determine the language, but they can be :)
- private function _detect_from_rules ($ tmp_str = null )
- {
- if ($ tmp_str == null ) return false ; // if nothing is passed, exit
- if (! is_array ($ this -> _ lang_rules)) return false ;
- // enumerate all rules
- foreach ($ this -> _ lang_rules as $ lang_code => $ lang_rules)
- {
- $ tmp_freq = 0;
- foreach ($ lang_rules as $ rule)
- {
- $ tmp_term = mb_substr_count ($ tmp_str, $ rule);
- if ($ tmp_term> 1) // i.e. a character in order 1 or more times
- {
- $ tmp_freq ++; // increase the count of the characters of the language in this line
- }
- // now check
- if ($ this -> match_all_rules === true )
- {
- // need to match all the rules
- if ($ tmp_freq == count ($ lang_rules)) return $ lang_code;
- }
- else
- {
- // one is enough
- if ($ tmp_freq> 0) return $ lang_code;
- }
- }
- }
- return false ;
- }
- // function of determining the language on the table
- private function _detect_from_tables ($ tmp_str = null )
- {
- if ($ tmp_str == null ) return false ; // if nothing is passed, exit
- // we already have to process the string for comparison before
- // go through all the languages and determine the probability for each
- $ lang_res = array ();
- foreach ($ this -> lang as $ lang_code => $ lang_name)
- {
- $ lang_res [$ lang_code] = 0; // default is 0, that is, not this language
- $ tmp_freq = 0; // character frequency of the current language
- $ full_lang_symbols = 0; // full number of characters of this language
- // since the length of the string can be arbitrary, and the alphabet is the same, then loop over the alphabets
- $ cur_lang = $ this -> _ langs [$ lang_code];
- foreach ($ cur_lang as $ l_item)
- {
- // now see the number of occurrences of a character in a string
- $ tmp_term = mb_substr_count ($ tmp_str, $ l_item);
- if ($ tmp_term> 1) // i.e. a character in order 1 or more times
- {
- $ tmp_freq ++; // increase the count of the characters of the language in this line
- $ full_lang_symbols + = $ tmp_term;
- }
- }
- if ($ this -> use_str_len_per_lang === true )
- {
- // use total characters
- $ lang_res [$ lang_code] = $ full_lang_symbols;
- }
- else
- // Calculate the percentage of all characters in the alphabet
- $ lang_res [$ lang_code] = ceil ((100 / count ($ cur_lang)) * $ tmp_freq);
- }
- // so, now let's see what happened
- arsort ($ lang_res, SORT_NUMERIC); // sort the array first element language with greater probability
- if ($ this -> return_all_results == true )
- {
- return $ lang_res; // if all results are returned, we return, otherwise select the best one
- }
- else
- {
- // if more than the specified threshold, return the language code, otherwise - null (that is, we can not determine the language code)
- $ key = key ($ lang_res);
- if ($ lang_res [$ key]> = $ this -> detect_range)
- return $ key;
- else
- return false ;
- }
- }
- // common function to determine the language
- public function lang_detect ($ tmp_str = null )
- {
- if ($ tmp_str == null ) return false ; // if nothing is passed, exit
- $ tmp_str = $ this -> _ prepare_str ($ tmp_str);
- if ($ tmp_str === false ) return false ;
- // if the rules apply to the table
- if ($ this -> use_rules_only === true )
- {
- $ res = $ this -> _ detect_from_rules ($ tmp_str);
- return array ($ res, $ this -> lang [$ res]);
- }
- else
- {
- // when using tables, we can not get a full layout of the results, because disable
- $ this -> return_all_results = false ;
- $ res = $ this -> _ detect_from_tables ($ tmp_str);
- if ($ tmp_str === false ) return false ;
- if ($ this -> use_rules === true )
- {
- $ res_rules = $ this -> _ detect_from_rules ($ tmp_str);
- // proceed from the priority settings of rules and statistics
- if ($ this -> use_rules_priory === true )
- {
- // rules are more powerful than statistics
- return array ($ res_rules, $ this -> lang [$ res_rules]);
- }
- else
- {
- return array ($ res, $ this -> lang [$ res]);
- }
- }
- else
- return array ($ res, $ this -> lang [$ res]);
- }
- }
- }
* This source code was highlighted with Source Code Highlighter .PS Of course, the code does not claim perfection, and, perhaps, the network already has implementations of this functionality, which I did not find. If you know known implementations - please let me know in the comments. The original article
is posted on my blog .