📜 ⬆️ ⬇️

PHP: Determining the language of text using N-grams. Part 1

Note : I could not for some reason restore my translation, for which I received an invite and it disappeared somewhere. Therefore, publish it again.

Usually, when we look at a text, we break it down into words and use these words to determine the language in which it is written. However, there are many ways to do this by comparing other text units. For example, letter n-grams.

N-grams are simply n-letter sequences extracted from a document. For example, the word “constable” decomposed into trigrams (three-letter sequences) will look like this: {“con”, “ons”, “nst”, “ste”, “te”, “ebl”, “bl”}. There are many ways to extract such sequences. More or less obvious below. Using this function, you can extract n-grams from the input string. By default, trigrams are extracted.
')


<?php

function getNgrams ( $ word , $ n = 3 ) {
$ ngrams = array ( ) ;
for ( $ i = 0 ; $ i < strlen ( $ match ) ; $ i + + ) {
if ( $ i > ( $ n - 2 ) ) {
$ ng = ' ' ;
for ( $ j = $ n - 1 ; $ j > = 0 ; $ j - - ) {
$ ng . = $ match [ $ i - $ j ] ;
}
$ ngrams [ ] = $ ng ;
}
}
return $ ngrams ;
}
FractalizeR's HabraSyntax Source Code Highlighter .


Language definition


Looking at the text, divided into n-grams, you can see that with their help it is quite easy to determine the language in which it is written. To do this, there are many algorithms that use double or trigrams that calculate different coefficients of "similarity", but they all agree on one thing: first you should build a statistical model of the distribution of trigrams in each language, and then see which of the constructed models most closely matches the specified text . In our example, we will use trigrams and a cosine measure of vector space similarity (vector space style cosine similarity).

One of the most obvious questions is “what to do with spaces”. In our example, we ignore them and generate trigrams only from words. We also ignore words less than 3 letters long. We will take into account only the frequency with which the trigram is encountered in this text. In principle, we can increase the determination accuracy by entering into the algorithm, for example, the global weight of the trigram, which shows how common this trigram is for all languages ​​of our model. But even without this, the trigram method works fairly accurately, especially with a small number of languages.

Below is a small class that implements detection. The key methods are addDocument, which breaks the input document into trigrams and stores the frequencies with which it occurs in each language in the internal dictionary (our model’s teaching method, approx. Lane), and detect, which breaks the incoming text in the same way and for each Trigram checks the frequency of its presence in each language of our model.

<?php

class LangDetector {
private $ index = array ( ) ;
private $ languages = array ( ) ;

public function addDocument ( $ document , $ language ) {
if ( ! isset ( $ this -> languages [ $ language ] ) ) {
$ this -> languages [ $ language ] = 0 ;
}

$ words = $ this -> getWords ( $ document ) ;
foreach ( $ words as $ match ) {
$ trigrams = $ this -> getNgrams ( $ match ) ;
foreach ( $ trigrams as $ trigram ) {
if ( ! isset ( $ this -> index [ $ trigram ] ) ) {
$ this -> index [ $ trigram ] = array ( ) ;
}
if ( ! isset ( $ this -> index [ $ trigram ] [ $ language ] ) ) {
$ this -> index [ $ trigram ] [ $ language ] = 0 ;
}
$ this -> index [ $ trigram ] [ $ language ] + + ;
}
$ this -> languages [ $ language ] + = count ( $ trigrams ) ;
}
}

public function detect ( $ document ) {
$ words = $ this -> getWords ( $ document ) ;
$ trigrams = array ( ) ;
foreach ( $ words as $ word ) {
foreach ( $ this -> getNgrams ( $ word ) as $ trigram ) {
if ( ! isset ( $ trigrams [ $ trigram ] ) ) {
$ trigrams [ $ trigram ] = 0 ;
}
$ trigrams [ $ trigram ] + + ;
}
}
$ total = array_sum ( $ trigrams ) ;

$ scores = array ( ) ;
foreach ( $ trigrams as $ trigram = > $ count ) {
if ( ! isset ( $ this -> index [ $ trigram ] ) ) {
continue ;
}
foreach ( $ this -> index [ $ trigram ] as $ language = > $ lCount ) {
if ( ! isset ( $ scores [ $ language ] ) ) {
$ scores [ $ language ] = = 0 ;
}
$ score = ( $ lCount / $ this -> languages [ $ language ] )
* ( $ count / $ total ) ;
$ scores [ $ language ] + = $ score ;
}
}
arsort ( $ scores ) ;
return key ( $ scores ) ;
}

private function getWords ( $ document ) {
$ document = strtolower ( $ document ) ;
preg_match_all ( ' /\w+/ ' , $ document , $ matches ) ;
return $ matches [ 0 ] ;
}

private function getNgrams ( $ match , $ n = 3 ) {
$ ngrams = array ( ) ;
for ( $ i = 0 ; $ i < strlen ( $ match ) ; $ i + + ) {
if ( $ i > ( $ n - 2 ) ) {
$ ng = ' ' ;
for ( $ j = $ n - 1 ; $ j > = 0 ; $ j - - ) {
$ ng . = $ match [ $ i - $ j ] ;
}
$ ngrams [ ] = $ ng ;
}
}
return $ ngrams ;
}
}
?>
FractalizeR's HabraSyntax Source Code Highlighter .


Continuation of the article here .

Source: https://habr.com/ru/post/75509/


All Articles