Now almost every article mentions that only utf should be used, because it is modern, universal and generally very useful. Without in any way denying this fact, I would like to express bewilderment to those authors who simultaneously say the speed of the scripts, appealing to the fact that it is better to write ++ i than i ++, because of the speed of work.
So the surprise is that working with utf is slower than with cp1251. Because the size is larger and there is no “alignment” of letters by bytes. It's about
php / mysqlIn fact, this is nothing particularly terrible. In contrast to the jambs in the code, using utf does not slow down the work as much, but also slows down linearly, so the issue in most cases is very easy to solve by scaling. If you have never tried to force out more powerful money from the customer / employer to the server, this should reassure you.
')
If you are not reassured, then below are a few numbers that may be useful to you.
Patient: not very powerful VDS, only one at the node (so it's easier to drag it here, but here it doesn't matter), a couple of tables for several million lines, a Russian text, a bit of English. Before each test reboot, the server is no longer loaded. Tests are conducted at least 3 times, the table goes average.
What kind of data | UTF Results | CP1251 results | Cp1251 advantage |
---|
MyISAM (text, text, int, int) | ***** | ***** | ***** |
The original size of the database | 1.250 GB | 0.975 GB | 1.28 times |
Basic data | 706 MB | 479 MB | 1.47 times |
Index data | 544 MB | 496 MB | 1.09 times |
Request to delete parts of lines | 16 seconds | 7 sec | 2.28 times |
Removing fulltext index | 26 sec | 23 seconds | 1.13 times |
Building a fulltext index | 6 min 22 sec | 3 min 12 sec | 1.98 times |
Request to find an exact entry, 10 times * 1 | 9.67 sec | 1.92 seconds | 5.03 times |
mysqldump export to file | 8.8 seconds | 4.9 sec | 1.79 times |
mysql import from file | 13.8 seconds | 8.7 sec | 1.58 times |
file size * .sql | 773 MB | 526 MB | 1.46 times |
Sphinx indexing | 103 sec | 41 sec | 2.51 times |
Sphinx base size | 680 MB | 433 MB | 1.57 times |
innoDB (text, text, int, int) * 3 | ***** | ***** | ***** |
The original size of the database | 925 MB | 629 MB | 1.47 times |
Request to delete parts of lines | 21.2 seconds | 12 sec | 1.76 times |
Request to find the exact entry, 10 times | 33.47 seconds | 21.89 seconds | 1.52 times |
mysqldump export to file | 23 seconds | 17 sec | 1.35 times |
mysql import from file * 4 | 8m 24sec | 5m 41 sec | 1.47 times |
file size * .sql | 748 MB | 510 MB | 1.46 times |
Memory int, char (128) * 2 | ***** | ***** | ***** |
Memory table size | 515 MB | 179 MB | 2.87 times |
Memory table row length | 390 | 133 | 2.93 times |
Memory table 1000 searches, something found every time | 1.9 seconds | 0.32 seconds | 5.93 times |
Memory Table 1000 searches, nothing found | 1.8 sec | 0.28 sec | 6.42 times |
* 1 : Staying in shock from these numbers, a similar test was launched on localhost, the advantage was reduced to 3.02 times. Perhaps something somewhere did not fall into the cache or unnecessarily fell out on the disk in the case of utf, there is more data.
* 2 : Memory tables are used to search for exact entries, they contain pure Russian text and some spaces, it turns out simply and quickly. Approximately 2 million lines. The size of memory tables in utf8 is 3 times larger than cp1251, since a fixed size is used (for memory in another way) and uft8 in it reserves 3 bytes per character.
* 3 : For innoDB, the full-text index was not tested due to the lack of support in inDOB. InnoDB used a table of a slightly different size than in MyISAM and another VTS, so there is no direct comparison of absolute results.
* 4 : It is very incomprehensible why the import in innoDB took so much. For MyISAM, the difference between import and export is minimal.
And some common words. Generally speaking, this “article” in draft form appeared a few years ago. Now only the Sphinx has been added here and the tests have been repeated. And it arose as a result of a dispute at some forum, about the viability of utf and that they say other encodings will die in a year. But not dead.
And the problems in php / mysql, for example, are still very different. You must write utf, then utf-8, then utf8. And even utf can be both ru_RU.UTF and en_EN.UTF, which gives funny effects with iconv // ignore // translit, God knows why. If you install php as a module, the locale is the same for the whole server with all the consequences, and even with the correct locale you cannot use normal functions for working with strings, you must use their counterparts that support this work. In general, utf is certainly an advanced technology ... but it must be applied thoughtfully and without excessive fanaticism.
PS: For those who like to compress traffic proxy servers note - html file in utf8 is 5-20% more, even in gzip-e