📜 ⬆️ ⬇️

The speed of working with utf is obvious, but little known for beginners.

Now almost every article mentions that only utf should be used, because it is modern, universal and generally very useful. Without in any way denying this fact, I would like to express bewilderment to those authors who simultaneously say the speed of the scripts, appealing to the fact that it is better to write ++ i than i ++, because of the speed of work.

So the surprise is that working with utf is slower than with cp1251. Because the size is larger and there is no “alignment” of letters by bytes. It's about php / mysql


In fact, this is nothing particularly terrible. In contrast to the jambs in the code, using utf does not slow down the work as much, but also slows down linearly, so the issue in most cases is very easy to solve by scaling. If you have never tried to force out more powerful money from the customer / employer to the server, this should reassure you.
')
If you are not reassured, then below are a few numbers that may be useful to you.
Patient: not very powerful VDS, only one at the node (so it's easier to drag it here, but here it doesn't matter), a couple of tables for several million lines, a Russian text, a bit of English. Before each test reboot, the server is no longer loaded. Tests are conducted at least 3 times, the table goes average.

What kind of dataUTF ResultsCP1251 resultsCp1251 advantage
MyISAM (text, text, int, int)***************
The original size of the database1.250 GB0.975 GB1.28 times
Basic data706 MB479 MB1.47 times
Index data544 MB496 MB1.09 times
Request to delete parts of lines16 seconds7 sec2.28 times
Removing fulltext index26 sec23 seconds1.13 times
Building a fulltext index6 min 22 sec3 min 12 sec1.98 times
Request to find an exact entry, 10 times * 19.67 sec1.92 seconds5.03 times
mysqldump export to file8.8 seconds4.9 sec1.79 times
mysql import from file13.8 seconds8.7 sec1.58 times
file size * .sql773 MB526 MB1.46 times
Sphinx indexing103 sec41 sec2.51 times
Sphinx base size680 MB433 MB1.57 times
innoDB (text, text, int, int) * 3***************
The original size of the database925 MB629 MB1.47 times
Request to delete parts of lines21.2 seconds12 sec1.76 times
Request to find the exact entry, 10 times33.47 seconds21.89 seconds1.52 times
mysqldump export to file23 seconds17 sec1.35 times
mysql import from file * 48m 24sec5m 41 sec1.47 times
file size * .sql748 MB510 MB1.46 times
Memory int, char (128) * 2***************
Memory table size515 MB179 MB2.87 times
Memory table row length3901332.93 times
Memory table 1000 searches, something found every time1.9 seconds0.32 seconds5.93 times
Memory Table 1000 searches, nothing found1.8 sec0.28 sec6.42 times


* 1 : Staying in shock from these numbers, a similar test was launched on localhost, the advantage was reduced to 3.02 times. Perhaps something somewhere did not fall into the cache or unnecessarily fell out on the disk in the case of utf, there is more data.
* 2 : Memory tables are used to search for exact entries, they contain pure Russian text and some spaces, it turns out simply and quickly. Approximately 2 million lines. The size of memory tables in utf8 is 3 times larger than cp1251, since a fixed size is used (for memory in another way) and uft8 in it reserves 3 bytes per character.
* 3 : For innoDB, the full-text index was not tested due to the lack of support in inDOB. InnoDB used a table of a slightly different size than in MyISAM and another VTS, so there is no direct comparison of absolute results.
* 4 : It is very incomprehensible why the import in innoDB took so much. For MyISAM, the difference between import and export is minimal.



And some common words. Generally speaking, this “article” in draft form appeared a few years ago. Now only the Sphinx has been added here and the tests have been repeated. And it arose as a result of a dispute at some forum, about the viability of utf and that they say other encodings will die in a year. But not dead.
And the problems in php / mysql, for example, are still very different. You must write utf, then utf-8, then utf8. And even utf can be both ru_RU.UTF and en_EN.UTF, which gives funny effects with iconv // ignore // translit, God knows why. If you install php as a module, the locale is the same for the whole server with all the consequences, and even with the correct locale you cannot use normal functions for working with strings, you must use their counterparts that support this work. In general, utf is certainly an advanced technology ... but it must be applied thoughtfully and without excessive fanaticism.

PS: For those who like to compress traffic proxy servers note - html file in utf8 is 5-20% more, even in gzip-e

Source: https://habr.com/ru/post/116822/


All Articles