PHP: Determining the language of text using N-grams. Part 2

The second part of an article by Jan Barber on the definition of text language using PHP. The first part can be found here .

It was necessary to break into two parts because of the large amount of text with formatting (“Some error ... We know ...).

Unfortunately, my computer did not have a particularly large amount of materials for building models, but for this purpose, multilingual OSX dictionaries came together. Removing the XML tags with the strip_tags, I got the plain text.

<?php $ lang = new LangDetector ( ) ; $ dir = " /Library/Dictionaries/Apple Dictionary.dictionary/Contents/Resources/ " ; $ dutch = strip_tags ( file_get_contents ( $ dir . " Dutch.lproj/Body.data " ) ) ; $ lang -> adddocument ( $ dutch , ' dutch ' ) ; $ english = strip_tags ( file_get_contents ( $ dir . " English.lproj/Body.data " ) ) ; $ lang -> adddocument ( $ english , ' english ' ) ; $ finnish = strip_tags ( file_get_contents ( $ dir . " fi.lproj/Body.data " ) ) ; $ lang -> adddocument ( $ finnish , ' finnish ' ) ; $ spanish = strip_tags ( file_get_contents ( $ dir . " Spanish.lproj/Body.data " ) ) ; $ lang -> adddocument ( $ spanish , ' spanish ' ) ; $ italian = strip_tags ( file_get_contents ( $ dir . " Italian.lproj/Body.data " ) ) ; $ lang -> adddocument ( $ italian , ' italian ' ) ; $ french = strip_tags ( file_get_contents ( $ dir . " French.lproj/Body.data " ) ) ; $ lang -> adddocument ( $ french , ' french ' ) ; $ swedish = strip_tags ( file_get_contents ( $ dir . " sv.lproj/Body.data " ) ) ; $ lang -> adddocument ( $ swedish , ' swedish ' ) ; ?>

FractalizeR's HabraSyntax Source Code Highlighter .

With the index built, we can now test a large number of texts in various languages to ensure recognition accuracy. Many thanks to Lorenzo, Soila (who speaks a huge number of different languages) and Ivo for the examples provided:

<?php $ italian = " Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura ché la diritta via era smarrita. " ; echo $ italian , " \n " , " is " , $ lang -> detect ( $ italian ) , " \n " ; $ finnish = " Suomalainen on sellainen, joka vastaa kun ei kysytä, kysyy kun ei vastata, ei vastaa kun kysytään, sellainen, joka eksyy tieltä, huutaa rannalla ja vastarannalla huutaa toinen samanlainen. " ; echo $ finnish , " \n " , " is " , $ lang -> detect ( $ finnish ) , " \n " ; $ dutch = " zoals het klokje thuis tikt, tikt het nergens " ; echo $ dutch , " \n " , " is " , $ lang -> detect ( $ dutch ) , " \n " ; $ spanish = " Por qué los inmensos aviones No se pasean com sus hijos? Cuál es el pájaro amarillo Que llena el nido de limones? Por qué no enseñan a sacar Miel del sol a los helicópteros? " ; echo $ spanish , " \n " , " is " , $ lang -> detect ( $ spanish ) , " \n " ; $ swedish = " Och knyttet tog av skorna och suckade och sa: hur kan det kännas sorgesamt fast allting är så bra? Men vem ska trösta knyttet med att säga: lilla vän, vad gör man med en snäcka om man ej får visa den? " ; echo $ swedish , " \n " , " is " , $ lang -> detect ( $ swedish ) , " \n " ; ?>

FractalizeR's HabraSyntax Source Code Highlighter .

')
As you can easily see (the result of the script is slightly trimmed to shorten the presentation), each language was correctly defined:

Nel mezzo del cammin...
is italian

Suomalainen on sellainen...
is finnish

zoals het klokje thuis tikt, tikt het nergens
is dutch

Por que los inmensos...
is spanish

Och knyttet tog av...
is swedish

A similar operation can be done on whole websites by removing HTML tags with strip_tag. The goal was the sites of three local offices of Ibuildings:

<?php $ nl = strip_tags ( file_get_contents ( ' www.ibuildings.nl ' ) ) ; echo " IB NL reads as: " . $ lang -> detect ( $ nl ) , " \n " ; $ uk = strip_tags ( file_get_contents ( ' www.ibuildings.co.uk ' ) ) ; echo " IB Uk reads as: " . $ lang -> detect ( $ uk ) , " \n " ; $ it = strip_tags ( file_get_contents ( ' www.ibuildings.it ' ) ) ; echo " IB IT reads as: " . $ lang -> detect ( $ it ) , " \n " ; ?>

FractalizeR's HabraSyntax Source Code Highlighter .

It seems that on the page in the NL domain there is still more English text than Danish, so it was defined as English. However, with Italian everything went fine:

IB NL reads as: english
IB UK reads as: english
IB IT reads as: italian

Other methods

Despite the fact that the trigram method is very convenient and simple, it is not necessarily the best to use in every situation. For example, if you need a method that works without prior training or with minimal memory, you can simply compile a list of short, frequently occurring words in each language (such as articles and prepositions) and search only those in a given text.

Similarly, searching for Unicode characters that are unique to a given language can give you sufficient accuracy in its definition.

PEAR: Text_LanguageDetect

When Lorenzo and I discussed this problem, he mentioned that a package for determining the language of the text is already included in the PEAR library, albeit in an alpha version. He also uses the trigram method, but he has a few richer possibilities. As expected, it is fairly simple to use, supports Unicode, and comes with a ready-made trigram base for some languages, so it needs little training. For completeness, we tried to determine with its help the languages of the same text fragments:

<?php require_once ' Text/LanguageDetect.php ' ; function detect ( $ text , $ l ) { $ result = $ l -> detect ( $ text , 1 ) ; if ( PEAR :: isError ( $ result ) ) { return $ result -> getMessage ( ) ; } else { return key ( $ result ) ; } } $ l = new Text_LanguageDetect ( ) ; $ italian = " Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura ché la diritta via era smarrita. " ; echo $ italian , " \n " , " is " , detect ( $ italian , $ l ) , " \n " ; // ... , ?>

FractalizeR's HabraSyntax Source Code Highlighter .

As expected, the result was similar. The package can be easily installed from PEAR using the command

pear -d preferred_state=alpha install Text_LanguageDetect

Source: https://habr.com/ru/post/75512/

All Articles

PHP: Determining the language of text using N-grams. Part 2

Other methods

PEAR: Text_LanguageDetect

More articles: