Need advice on encodings

I wanted to put it in “we write CMS”, but he says that there is not enough karma. So to myself, I hope, will catch the eye of someone with whom you can pomgozgovat : (

Upd: I know that it’s right to write everything in utf-8. Moreover, in my personal projects, or those that I write to order from scratch, everything is only in this encoding and the problem does not arise at all. Therefore, it is not necessary for the 10th time in the comments to write a banality about Unicode. I know it. Question about cases when it is impossible

Upd2: thanks for the karma, the topic moved
')
Historically, my framework not only works on systems with different encodings (utf-8, windows-1251, koi8-r), but often in mixed conditions (the database gives the data in utf-8, the client must receive in windows-1251, the files are in koi8-r, the client receives in utf-8, the content of the site is given in koi8-r, but the RSS is sent in utf-8, etc. combinations).

Up to a certain point, everything was fine:

1. All texts in PHP-code are in utf-8, but when loading the system translates them into the internal encoding of the system. For example:

  class ... function title () {return ec ("Test");  }

where ec () is the function that performs the utf8-> internal_charset conversion

2. All text operations (upper / lower / substr / etc) are performed in the internal server coding.

3. On output, the internal_charset -> output_charset conversion takes place.

4. When data is loaded from user files, files_charset -> internal_charset is converted

5. When data is loaded from the database, db_charset-> internal_charset is converted.

6. All Smarty-templates in utf-8 and when they are loaded are recoded into internal_charset.

Everything worked fine until I needed pure PHP templates. Well, with logic, everything is clear. The class prepares a data block. When rendering, the system unpacks them in scope and makes include () the desired template, intercepting the output. Then uses the result.

And here I will have the first plug. For simplicity, consider a specific example.

Let internal_encoding, the system coding, we have koi8-r. PHP template, for uniformity sake, in utf-8. Without any conversion, porridge is immediately obtained: koi8-r data is inserted into utf-8 text in PHP.

Then I made it obvious, but not for me then, the wrong decision. I voluntarily accepted that internal_encoding is always utf-8. The advantages were obvious: there is no need for ec ("") functions, since internal is always the same as the main templates. In Smarty, in {file ...} or {include ...} instead of your xfile file type: // [whose bootloader, along with others, was involved in transcoding], you can use regular files, PHP templates are inserted without comment. And, in general, it is pleasant to live in a somehow unified world :)

It is clear already where the crutch surfaced? internal_charset! = PHP system locale. Strtolower / strtoupper / substr does not work ...

And now I'm standing at a crossroads. And I ask for advice on how to clean it up :)

The first option I see frontal. Now partially resolving the situation to them. Introduce the concept of system encoding. Those. system locale. We change all strtolower () to our u_lower (), where we make iconv from the internal coding of the framework into the system encoding, then the strtolower and back to the internal one. Pros - remains a unified framework encoding. There is still no need for ec (). More fine tuning is possible on systems with buggy mb_string, etc. Cons - the use of its functions instead of the standard. Excess CPU load. Small, but if it is somewhere deep in the loop?

The second option. internal_charset is always equal to the system locale, in general, it is not equal to utf-8. PHP templates, like the rest of the system - in utf-8. When loading PHP templates, the data fed by it is re-encoded from internal to utf-8. The captured output is then recoded from utf-8 to internal. Pros - the system can use standard PHP-functions without overhead. Disadvantages - when accessing other data from a template that is not directly submitted, recoding is necessary in the template (for example, I can recode $ title, but $ items [0] -> title () will be in the system encoding). We'll have to use the conversion function from the system encoding to utf-8. Those. if we can display the main data as it is:

  Hi, <? = $ Title?>

, internal data will have to be output in something like

  Buy <? Dc ($ items-> title ())?>

where dc () is engaged in the conversion intrnal -> utf-8. And this is also some overhead projector, especially if, again, in a loop.

There was still some option in my head, now I flew out, but he is absolutely crazy :)

While I tend more to the second. All the same, Unicode is Unicode, but it is better to live in the system encoding in the system. Allows the system to enable utf8 - great. No, it’s not for us to choose ... In addition, when implementing the second option, it will be necessary to rewrite the bare minimum of the finished code.

Maybe a fresh look from the side will tell a more elegant solution?

Source: https://habr.com/ru/post/55973/

All Articles

Need advice on encodings

More articles: