📜 ⬆️ ⬇️

PHP text encoding definition - overview of existing solutions plus another bike

Faced a task - auto-detect page encoding / text / anything. The task is not new, and many bicycles have already been thought of. The article contains a small review of what was found on the net - plus the offer of its own, as it seems to me, worthy of a solution.

1. Why not mb_detect_encoding ()?


In short - it does not work.

Let's watch:
//   -     CP1251 $string = iconv('UTF-8', 'Windows-1251', '    ,   ,       ,     .'); // ,    md_detect_encoding().  $strict = FALSE var_dump(mb_detect_encoding($string, array('UTF-8'))); // UTF-8 var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251'))); // Windows-1251 var_dump(mb_detect_encoding($string, array('UTF-8', 'KOI8-R'))); // KOI8-R var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251', 'KOI8-R'))); // FALSE var_dump(mb_detect_encoding($string, array('UTF-8', 'ISO-8859-5'))); // ISO-8859-5 var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251', 'KOI8-R', 'ISO-8859-5'))); // ISO-8859-5 //  $strict = TRUE var_dump(mb_detect_encoding($string, array('UTF-8'), TRUE)); // FALSE var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251'), TRUE)); // FALSE var_dump(mb_detect_encoding($string, array('UTF-8', 'KOI8-R'), TRUE)); // FALSE var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251', 'KOI8-R'), TRUE)); // FALSE var_dump(mb_detect_encoding($string, array('UTF-8', 'ISO-8859-5'), TRUE)); // ISO-8859-5 var_dump(mb_detect_encoding($string, array('UTF-8', 'Windows-1251', 'KOI8-R', 'ISO-8859-5'), TRUE)); // ISO-8859-5 

As you can see, the output is a complete mess. What do we do when it is unclear why the function behaves this way? Right, google it. Found a wonderful answer .

To finally dispel all hopes of using mb_detect_encoding (), you need to get into the sources of the mbstring extension. So, rolled up the sleeves, let's go:
 // ext/mbstring/mbstring.c:2629 PHP_FUNCTION(mb_detect_encoding) { ... //  2703 ret = mbfl_identify_encoding_name(&string, elist, size, strict); ... 

Ctrl + click:
 // ext/mbstring/libmbfl/mbfl/mbfilter.c:643 const char* mbfl_identify_encoding_name(mbfl_string *string, enum mbfl_no_encoding *elist, int elistsz, int strict) { const mbfl_encoding *encoding; encoding = mbfl_identify_encoding(string, elist, elistsz, strict); ... 

Ctrl + click:
 // ext/mbstring/libmbfl/mbfl/mbfilter.c:557 /* * identify encoding */ const mbfl_encoding * mbfl_identify_encoding(mbfl_string *string, enum mbfl_no_encoding *elist, int elistsz, int strict) { ... 

I will not post the full text of the method, so as not to clutter an article with unnecessary sources. Who is interesting to see for yourself. We are truncated by the line number 593, where the actual check is made to see if the character fits the encoding:
 // ext/mbstring/libmbfl/mbfl/mbfilter.c:593 (*filter->filter_function)(*p, filter); if (filter->flag) { bad++; } 

Here are the basic filters for single-byte Cyrillic:
')
Windows-1251 (original comments saved)
 // ext/mbstring/libmbfl/filters/mbfilter_cp1251.c:142 /* all of this is so ugly now! */ static int mbfl_filt_ident_cp1251(int c, mbfl_identify_filter *filter) { if (c >= 0x80 && c < 0xff) filter->flag = 0; else filter->flag = 1; /* not it */ return c; } 


KOI8-R
 // ext/mbstring/libmbfl/filters/mbfilter_koi8r.c:142 static int mbfl_filt_ident_koi8r(int c, mbfl_identify_filter *filter) { if (c >= 0x80 && c < 0xff) filter->flag = 0; else filter->flag = 1; /* not it */ return c; } 


ISO-8859-5 (in general, everything is fun)
 // ext/mbstring/libmbfl/mbfl/mbfl_ident.c:248 int mbfl_filt_ident_true(int c, mbfl_identify_filter *filter) { return c; } 

As you can see, ISO-8859-5 always returns TRUE (to return FALSE, you need to set filter-> flag = 1).

When they looked at the filters, everything fell into place. CP1251 from KOI8-R can not be distinguished in any way. ISO-8859-5 in general, if it is in the list of encodings, will always be detected as true.

In general, fail. It is understandable - it is impossible to find out the encoding in the general case only by the character codes, since these codes intersect in different encodings.

2. What does Google give?


And Google gives all sorts of misery. I will not even post the source code here, see for yourself if you want (remove the space after http: //, I don’t know how to show the text with a link):

http: // deer.org.ua/2009/10/06/1/
http: // php.su/forum/topic.php?forum=1&topic=1346

3. Search by habr


1) again character codes: habrahabr.ru/blogs/php/27378/#comment_710532

2) in my opinion, a very interesting solution: habrahabr.ru/blogs/php/27378/#comment_1399654
Cons and pros in the comments on the link. Personally, I think that only for detecting encoding this solution is redundant - it turns out too powerful. The definition of the encoding in it - as a side effect).

4. Actually, my decision


The idea arose while viewing the second link from the previous section. The idea is as follows: we take a large Russian text, measure the frequencies of different letters, with these frequencies we detect the encoding. Looking ahead, I will immediately say - there will be problems with large and small letters. Therefore, I post examples of the frequencies of letters (let's call it “spectrum”), both case-sensitive and without (in the second case, I added more to the small letter with the same frequency, and I deleted more with all). In these "spectra" all letters with frequencies less than 0.001 and space are cut out. Here is what I did after the processing of "War and Peace":

Register-dependent "spectrum":
 array ( '' => 0.095249209893009, '' => 0.06836817536026, '' => 0.067481298384992, '' => 0.055995027400041, '' => 0.052242744063325, .... '' => 0.002252892226507, '' => 0.0021318391371162, '' => 0.0018574762967903, '' => 0.0015961610948418, '' => 0.0014044332975731, '' => 0.0013188987793209, '' => 0.0012623590130186, '' => 0.0011804488387602, '' => 0.001061932790165, ) 


Register independent:
 array ( '' => 0.095249209893009, '' => 0.095249209893009, '' => 0.06836817536026, '' => 0.06836817536026, '' => 0.067481298384992, '' => 0.067481298384992, '' => 0.055995027400041, '' => 0.055995027400041, .... '' => 0.0029893589260344, '' => 0.0029893589260344, '' => 0.0024649163501406, '' => 0.0024649163501406, '' => 0.002252892226507, '' => 0.002252892226507, '' => 0.0015961610948418, '' => 0.0015961610948418, ) 


Spectra in different encodings (array keys - codes of the corresponding characters in the corresponding encoding):

Windows-1251: case sensitive , case insensitive
KOI8-R: case sensitive , case insensitive
ISO-8859-5: case sensitive , case insensitive

Further. We take the text of an unknown encoding, for each encoding being checked we find the frequency of the current character and add it to the “rating” of this encoding. The encoding with the highest rating is, most likely, the encoding of the text.

 $encodings = array( 'cp1251' => require 'specter_cp1251.php', 'koi8r' => require 'specter_koi8r.php', 'iso88595' => require 'specter_iso88595.php' ); $enc_rates = array(); for ($i = 0; $i < len($str); ++$i) { foreach ($encodings as $encoding => $char_specter) { $enc_rates[$encoding] += $char_specter[ord($str[$i])]; } } var_dump($enc_rates); 

Do not even try to execute this code in your home - it will not work. You can consider this a pseudocode - I omitted the details so as not to clutter the article. $ char_specter is just those arrays referenced by pastebin.

results

Table rows - text encoding, columns - the contents of the $ enc_rates array.

1) $ str = 'Russian text';
cp1251 | koi8r | iso88595 |
0.441 | 0.020 | 0.085 | Windows-1251
0.049 | 0.441 | 0.166 | KOI8-R
0.133 | 0.092 | 0.441 | ISO-8859-5

All perfectly. The real encoding has already been 4 times higher rating than the rest - it is on such a short text. On longer texts the ratio will be about the same.

2) $ str = 'LINE CAPSOM RUSSIAN TEXT';
cp1251 | koi8r | iso88595 |
0.013 | 0.705 | 0.331 | Windows-1251
0.649 | 0.013 | 0.201 | KOI8-R
0.007 | 0.392 | 0.013 | ISO-8859-5


Y-oops! Complete porridge. Because large letters in CP1251 usually correspond to small letters in KOI8-R. And small letters are used in turn much more often than large ones. So we define a caps string in CP1251 as KOI8-R.
We try to do without taking into account the case ("spectra" case insensitive)

1) $ str = 'Russian text';
cp1251 | koi8r | iso88595 |
0.477 | 0.342 | 0.085 | Windows-1251
0.315 | 0.477 | 0.207 | KOI8-R
0.216 | 0.321 | 0.477 | ISO-8859-5


2) $ str = 'LINE CAPSOM RUSSIAN TEXT';
cp1251 | koi8r | iso88595 |
1.074 | 0.705 | 0.465 | Windows-1251
0.649 | 1.074 | 0.201 | KOI8-R
0.331 | 0.392 | 1.074 | ISO-8859-5


As you can see, the correct encoding is consistently in the lead with both case-sensitive "spectra" (if the string contains a small number of capital letters), and with case-insensitive ones. In the second case, with case-insensitive, the lead is not so confident, of course, but it is quite stable even on small lines. You can play with the weights of letters - to make them non-linear with respect to frequency, for example.

5. Conclusion


In the topic, the work with UTF-8 is not considered - there is no fundamental difference, except that getting character codes and splitting a line into characters will be somewhat longer / more difficult.
These ideas can be extended not only to Cyrillic encodings, of course - the question is only in the “spectra” of the corresponding languages ​​/ encodings.

PS If it is very necessary / interesting - then lay out the second part of a fully working library on GitHub. Although I believe that the data in the post is quite enough to quickly write such a library and to your own needs - the “spectrum” for the Russian language is laid out, it can be easily transferred to all the necessary encodings.

UPDATED
In the comments slipped a great feature, a link to which I published under the column "squalor". Maybe I got excited with the words, but as published, I published it — I was not used to editing such things. Not to be unfounded, let's see if it works at 100%, as the alleged author says .
1) Will there be errors in the "normal" operation of this function? Suppose that our content is 100% valid.
answer: yes, they will.
2) Will it detect anything other than UTF-8 and non-UTF-8?
Answer: No, it will not.

Here is the code:
 $str_cp1251 = iconv('UTF-8', 'Windows-1251', ' '); var_dump(md5($str_cp1251)); var_dump(md5(iconv('Windows-1251', 'Windows-1251', $str_cp1251))); var_dump(md5(iconv('KOI8-R', 'KOI8-R', $str_cp1251))); var_dump(md5(iconv('ISO-8859-5', 'ISO-8859-5', $str_cp1251))); var_dump(md5(iconv('UTF-8', 'UTF-8', $str_cp1251))); 

what is the output:
 m00t@m00t:~/workspace/test$ php detect_encoding.php string(32) "96e14d7add82668414ffbc498fcf2a4e" string(32) "96e14d7add82668414ffbc498fcf2a4e" string(32) "96e14d7add82668414ffbc498fcf2a4e" string(32) "96e14d7add82668414ffbc498fcf2a4e" PHP Notice: iconv(): Detected an illegal character in input string in /home/m00t/workspace/test/detect_encoding.php on line 36 PHP Stack trace: PHP 1. {main}() /home/m00t/workspace/test/detect_encoding.php:0 PHP 2. iconv() /home/m00t/workspace/test/detect_encoding.php:36 string(32) "d41d8cd98f00b204e9800998ecf8427e" 

What do we see? Cyrillic single-byte after iconv ($ encoding, $ encodigng) will not change. So you can only distinguish UTF-8 from non-UTF-8. And then - at the cost of vorninga.
IMHO is precisely because of these pieces of code and consider PHP "language for fools" (c) - how not to write the trolls in any topic about this language.

Source: https://habr.com/ru/post/107945/


All Articles