📜 ⬆️ ⬇️

PHP: Determining the language of text using N-grams. Part 2

The second part of an article by Jan Barber on the definition of text language using PHP. The first part can be found here .

It was necessary to break into two parts because of the large amount of text with formatting (“Some error ... We know ...).

Unfortunately, my computer did not have a particularly large amount of materials for building models, but for this purpose, multilingual OSX dictionaries came together. Removing the XML tags with the strip_tags, I got the plain text.
<?php

$ lang = new LangDetector ( ) ;
$ dir = " /Library/Dictionaries/Apple Dictionary.dictionary/Contents/Resources/ " ;
$ dutch = strip_tags ( file_get_contents ( $ dir . " Dutch.lproj/Body.data " ) ) ;
$ lang -> adddocument ( $ dutch , ' dutch ' ) ;
$ english = strip_tags ( file_get_contents ( $ dir . " English.lproj/Body.data " ) ) ;
$ lang -> adddocument ( $ english , ' english ' ) ;
$ finnish = strip_tags ( file_get_contents ( $ dir . " fi.lproj/Body.data " ) ) ;
$ lang -> adddocument ( $ finnish , ' finnish ' ) ;
$ spanish = strip_tags ( file_get_contents ( $ dir . " Spanish.lproj/Body.data " ) ) ;
$ lang -> adddocument ( $ spanish , ' spanish ' ) ;
$ italian = strip_tags ( file_get_contents ( $ dir . " Italian.lproj/Body.data " ) ) ;
$ lang -> adddocument ( $ italian , ' italian ' ) ;
$ french = strip_tags ( file_get_contents ( $ dir . " French.lproj/Body.data " ) ) ;
$ lang -> adddocument ( $ french , ' french ' ) ;
$ swedish = strip_tags ( file_get_contents ( $ dir . " sv.lproj/Body.data " ) ) ;
$ lang -> adddocument ( $ swedish , ' swedish ' ) ;
?>

FractalizeR's HabraSyntax Source Code Highlighter .

With the index built, we can now test a large number of texts in various languages ​​to ensure recognition accuracy. Many thanks to Lorenzo, Soila (who speaks a huge number of different languages) and Ivo for the examples provided:

<?php

$ italian = "
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
ch&#233; la diritta via era smarrita.
" ;
echo $ italian , " \n " , " is " , $ lang -> detect ( $ italian ) , " \n " ;

$ finnish = "
Suomalainen on sellainen, joka vastaa kun ei kysyt&#228;,
kysyy kun ei vastata, ei vastaa kun kysyt&#228;&#228;n,
sellainen, joka eksyy tielt&#228;, huutaa rannalla
ja vastarannalla huutaa toinen samanlainen.
" ;
echo $ finnish , " \n " , " is " , $ lang -> detect ( $ finnish ) , " \n " ;

$ dutch = "
zoals het klokje thuis tikt, tikt het nergens
" ;
echo $ dutch , " \n " , " is " , $ lang -> detect ( $ dutch ) , " \n " ;

$ spanish = "
Por qu&#233; los inmensos aviones
No se pasean com sus hijos?
Cu&#225;l es el p&#225;jaro amarillo
Que llena el nido de limones?
Por qu&#233; no ense&#241;an a sacar
Miel del sol a los helic&#243;pteros?
" ;
echo $ spanish , " \n " , " is " , $ lang -> detect ( $ spanish ) , " \n " ;

$ swedish = "
Och knyttet tog av skorna och suckade och sa:
hur kan det k&#228;nnas sorgesamt fast allting &#228;r s&#229; bra?
Men vem ska tr&#246;sta knyttet med att s&#228;ga: lilla v&#228;n,
vad g&#246;r man med en sn&#228;cka om man ej f&#229;r visa den?
" ;
echo $ swedish , " \n " , " is " , $ lang -> detect ( $ swedish ) , " \n " ;
?>
FractalizeR's HabraSyntax Source Code Highlighter .

')
As you can easily see (the result of the script is slightly trimmed to shorten the presentation), each language was correctly defined:

Nel mezzo del cammin...
is italian

Suomalainen on sellainen...
is finnish

zoals het klokje thuis tikt, tikt het nergens
is dutch

Por que los inmensos...
is spanish

Och knyttet tog av...
is swedish


A similar operation can be done on whole websites by removing HTML tags with strip_tag. The goal was the sites of three local offices of Ibuildings:

<?php

$ nl = strip_tags ( file_get_contents ( ' www.ibuildings.nl ' ) ) ;
echo " IB NL reads as: " . $ lang -> detect ( $ nl ) , " \n " ;

$ uk = strip_tags ( file_get_contents ( ' www.ibuildings.co.uk ' ) ) ;
echo " IB Uk reads as: " . $ lang -> detect ( $ uk ) , " \n " ;

$ it = strip_tags ( file_get_contents ( ' www.ibuildings.it ' ) ) ;
echo " IB IT reads as: " . $ lang -> detect ( $ it ) , " \n " ;
?>
FractalizeR's HabraSyntax Source Code Highlighter .


It seems that on the page in the NL domain there is still more English text than Danish, so it was defined as English. However, with Italian everything went fine:

IB NL reads as: english
IB UK reads as: english
IB IT reads as: italian


Other methods


Despite the fact that the trigram method is very convenient and simple, it is not necessarily the best to use in every situation. For example, if you need a method that works without prior training or with minimal memory, you can simply compile a list of short, frequently occurring words in each language (such as articles and prepositions) and search only those in a given text.

Similarly, searching for Unicode characters that are unique to a given language can give you sufficient accuracy in its definition.

PEAR: Text_LanguageDetect

When Lorenzo and I discussed this problem, he mentioned that a package for determining the language of the text is already included in the PEAR library, albeit in an alpha version. He also uses the trigram method, but he has a few richer possibilities. As expected, it is fairly simple to use, supports Unicode, and comes with a ready-made trigram base for some languages, so it needs little training. For completeness, we tried to determine with its help the languages ​​of the same text fragments:

<?php

require_once ' Text/LanguageDetect.php ' ;

function detect ( $ text , $ l ) {
$ result = $ l -> detect ( $ text , 1 ) ;
if ( PEAR :: isError ( $ result ) ) {
return $ result -> getMessage ( ) ;
} else {
return key ( $ result ) ;
}
}

$ l = new Text_LanguageDetect ( ) ;

$ italian = "
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
ch&#233; la diritta via era smarrita.
" ;
echo $ italian , " \n " , " is " , detect ( $ italian , $ l ) , " \n " ;

// ... ,
?>
FractalizeR's HabraSyntax Source Code Highlighter .


As expected, the result was similar. The package can be easily installed from PEAR using the command

pear -d preferred_state=alpha install Text_LanguageDetect

Source: https://habr.com/ru/post/75512/


All Articles