📜 ⬆️ ⬇️

How to transcode latin1 to cyrillic

Every time they ask me the same question, they ask me about the same thing: “How to transcode cracks from a database storing latin1 strings into normal Cyrillic (windows-1251) or utf-8”.

Below I will try to answer this question most fully, and also give a piece of PHP code that uniquely solves the problem.

Firstly, I do not recommend to anyone to continue working in the windows-1251 encoding. This single-byte encoding no longer meets the requirements of modernity. Quickly translate all projects to utf-8. The faster this is done, the sooner you will have problems with krakozabrami.

Now about latin1. This encoding (also known as windows-1252) was commonly used previously in MySQL up to version 4. The symbolic table of Cyrillic letters is in it in place of Arabic characters. But since it is also single-byte, there are no problems when reading the data in this encoding from this table and outputting them as windows-1251, because the codes are the same (0xA0-0xFF). But all this will work only as long as you do not install MySQL 5+, working by default in utf-8.
')
What does MySQL 5+ do by passing you such data? Before transferring to the client's side, he honestly recodes all the data in utf-8, placing Arabic characters (and in latin1 your Cyrillic alphabet is actually Arabic characters) in the range of utf-8 codes where they should be. As a result, if you even try to recode the resulting utf-8-string back to Cyrillic with the iconv function ('utf-8', 'windows-1251', $ str), then you will fail. iconv will give an error or return an empty string.

The first thing a programmer does is he is trying to change the latin1 table encoding to windows-1251 in phpMyAdmin. But MySQL cannot do this (as he writes), because the corresponding Arabic characters are not in the windows-1251 encoding. The second thing that comes to mind is to convert this table to utf-8. And it turns out. Only here the texts are still displayed krakozabrami.

How to be? How to solve this problem ?

The solution here is quite simple, but in order to come to it yourself, you need to clearly understand what encodings are and how they work. In understanding my hand-made chart will help.

image

And here is the algorithm that I use to get the encodings in order.

  1. I translate all database tables in utf-8 encoding. At the same time, supposedly Cyrillic characters stored in the latin1 encoding, and therefore actually being Arabic, are translated to utf-8 and occupy their legitimate places in the range of utf-8 codes intended for Arabic characters.
  2. I am writing a micro-utility for PHP, which does the following with each character string:
    • a) Translates the string in windows-1252 encoding. There should be no problems. Thus, Arabic letters occupy the range of codes A0-FF.
    • b) Translates the received single-byte string to utf-8, but not as windows-1252, but as windows-1251, i.e. giving characters from the range A0-FF to Cyrillic. As a result, the characters fall into utf-8 in the range of codes that is intended for Cyrillic characters.

  3. Everything, now our line officially is the Cyrillic line in utf-8. It can be written back to the same DB cell, or immediately output to the output stream. However, I still recommend performing a one-time full database conversion, and forgetting latin1 as a nightmare.


Below is the sample code for PHP, which translates the user's full name into a normal Cyrillic encoding.

$q = 'select id, fio from `users`';
$res = mysql_query($q);
while (($row = mysql_fetch_assoc($res)) !== false) {
// fio utf-8/latin1 windows-1252
$s = iconv('utf-8', 'windows-1252', $row['fio']);
// utf-8, windows-1251
$s = iconv('windows-1251', 'utf-8', $s);
//
$q = 'update `users` set fio = "'.addslashes($s).'" where id = '.$row['id'];
mysql_query($q);
}

Source: https://habr.com/ru/post/137061/


All Articles