Defining text encoding in PHP, part 2 - bigrams

In the last article , the algorithm was implemented to automatically determine the text encoding based on the frequency distribution of characters. The comments noted : if you use bigrams (trigrams), the result will be more accurate. Then I waved off, they say, and on a single character a good result is obtained. But now I thought that it would be nice to add reliability and accuracy to the algorithm, especially the use of bigrams instead of single characters does not ask much to eat.

Under the cut - an example of the implementation of the algorithm on bigrams, the source code and the results of its work.

Algorithm Description

')
As always - we will work only with single-byte Russian encodings. To determine UTF-8, there is no point in writing such an algorithm: it is determined very simply:

$str_utf8 = ' '; $str_cp1251 = iconv('UTF-8', 'Windows-1251', $str_utf8); var_dump(preg_match('#.#u', $str_utf8)); var_dump(preg_match('#.#u', $str_cp1251));

 m00t@m00t:~/workspace/test$ php detect_encoding.php int(1) int(0)

So, we take a rather large Russian text and measure, instead of the frequencies of the letters, the frequencies of the pairs of letters (I traditionally took War and Peace). We get something like this (not the frequencies are shown here, but the number of mentions in the text, which is actually the same):

 <?php return array ( '' => 3, '' => 1127, '' => 5595, '' => 1373, '' => 3572, '' => 1483, '' => 0, '' => 1931, .... '' => 1325, '' => 2439, '' => 1, '' => 1, '' => 284, '' => 70, '' => 254, '' => 0, '' => 0, '' => 0, '' => 0, '' => 185, '' => 283, );

Next, we convert this case into all the necessary encodings, while adding more variants with different case of characters (at the same time we convert the number of references into the frequency):

 <?php return array ( '' => 2.5816978277594E-6, '' => 2.5816978277594E-6, '' => 2.5816978277594E-6, '' => 2.5816978277594E-6, '' => 0.00096985781729497, '' => 0.00096985781729497, '' => 0.00096985781729497, '' => 0.00096985781729497, '' => 0.0048148664487714, '' => 0.0048148664487714, '' => 0.0048148664487714, '' => 0.0048148664487714, ... '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0, '' => 0.0001592046993785, '' => 0.0001592046993785, '' => 0.0001592046993785, '' => 0.0001592046993785, '' => 0.00024354016175197, '' => 0.00024354016175197, '' => 0.00024354016175197, '' => 0.00024354016175197, );

I got three such files - one for each encoding: cp1251, koi8-r, iso8859-5.

Now, when we need to know the encoding of an unknown text, we go through it, isolate all pairs of characters and add to the "weight" of each encoding the frequency of this pair of characters from the already generated spectra.

Work results

Commented arrays are the total weights of pairs of characters in the text for the corresponding encoding, divided by the sum for all pods ( see test_detect_encoding.php: 21 ). Those. It can be said that these are the probabilities that the text is in this encoding.

 $data = iconv('UTF-8', 'iso8859-5', '  '); /* array(3) { ["windows-1251"]=> float(0.071131587263965) ["koi8-r"]=> float(0.19145038318717) ["iso8859-5"]=> float(0.73741802954887) } */ $data = iconv('UTF-8', 'windows-1251', '  '); /* array(3) { ["windows-1251"]=> float(0.95440659551352) ["koi8-r"]=> float(0.044353550201316) ["iso8859-5"]=> float(0.0012398542851665) } */ $data = file_get_contents('test/cp1251_1.html'); //  -    cp1251 /* array(3) { ["windows-1251"]=> float(0.78542385878465) ["koi8-r"]=> float(0.18514302234077) ["iso8859-5"]=> float(0.029433118874583) } */

As you can see, a very good result is obtained - even on small lines, the correct encoding leads almost by an order of magnitude.

All sources with generated spectra for three single-byte encodings (of course, for the Russian language) can be downloaded on a githaba . I intentionally did not design the whole thing in the form of a reusable one, because there is only a hundred lines of code. If it is necessary for anyone, he will be able to arrange for himself how he likes and how he imposes this or that used framework.

Source: https://habr.com/ru/post/127658/

All Articles

Defining text encoding in PHP, part 2 - bigrams

Algorithm Description

Work results

More articles: