The speed of working with utf is obvious, but little known for beginners.

Now almost every article mentions that only utf should be used, because it is modern, universal and generally very useful. Without in any way denying this fact, I would like to express bewilderment to those authors who simultaneously say the speed of the scripts, appealing to the fact that it is better to write ++ i than i ++, because of the speed of work.

So the surprise is that working with utf is slower than with cp1251. Because the size is larger and there is no “alignment” of letters by bytes. It's about php / mysql

In fact, this is nothing particularly terrible. In contrast to the jambs in the code, using utf does not slow down the work as much, but also slows down linearly, so the issue in most cases is very easy to solve by scaling. If you have never tried to force out more powerful money from the customer / employer to the server, this should reassure you.
')
If you are not reassured, then below are a few numbers that may be useful to you.
Patient: not very powerful VDS, only one at the node (so it's easier to drag it here, but here it doesn't matter), a couple of tables for several million lines, a Russian text, a bit of English. Before each test reboot, the server is no longer loaded. Tests are conducted at least 3 times, the table goes average.

What kind of data	UTF Results	CP1251 results	Cp1251 advantage
MyISAM (text, text, int, int)	*****	*****	*****
The original size of the database	1.250 GB	0.975 GB	1.28 times
Basic data	706 MB	479 MB	1.47 times
Index data	544 MB	496 MB	1.09 times
Request to delete parts of lines	16 seconds	7 sec	2.28 times
Removing fulltext index	26 sec	23 seconds	1.13 times
Building a fulltext index	6 min 22 sec	3 min 12 sec	1.98 times
Request to find an exact entry, 10 times * 1	9.67 sec	1.92 seconds	5.03 times
mysqldump export to file	8.8 seconds	4.9 sec	1.79 times
mysql import from file	13.8 seconds	8.7 sec	1.58 times
file size * .sql	773 MB	526 MB	1.46 times
Sphinx indexing	103 sec	41 sec	2.51 times
Sphinx base size	680 MB	433 MB	1.57 times
*innoDB (text, text, int, int) 3**	*****	*****	*****
The original size of the database	925 MB	629 MB	1.47 times
Request to delete parts of lines	21.2 seconds	12 sec	1.76 times
Request to find the exact entry, 10 times	33.47 seconds	21.89 seconds	1.52 times
mysqldump export to file	23 seconds	17 sec	1.35 times
mysql import from file * 4	8m 24sec	5m 41 sec	1.47 times
file size * .sql	748 MB	510 MB	1.46 times
*Memory int, char (128) 2**	*****	*****	*****
Memory table size	515 MB	179 MB	2.87 times
Memory table row length	390	133	2.93 times
Memory table 1000 searches, something found every time	1.9 seconds	0.32 seconds	5.93 times
Memory Table 1000 searches, nothing found	1.8 sec	0.28 sec	6.42 times

* 1 : Staying in shock from these numbers, a similar test was launched on localhost, the advantage was reduced to 3.02 times. Perhaps something somewhere did not fall into the cache or unnecessarily fell out on the disk in the case of utf, there is more data.
* 2 : Memory tables are used to search for exact entries, they contain pure Russian text and some spaces, it turns out simply and quickly. Approximately 2 million lines. The size of memory tables in utf8 is 3 times larger than cp1251, since a fixed size is used (for memory in another way) and uft8 in it reserves 3 bytes per character.
* 3 : For innoDB, the full-text index was not tested due to the lack of support in inDOB. InnoDB used a table of a slightly different size than in MyISAM and another VTS, so there is no direct comparison of absolute results.
* 4 : It is very incomprehensible why the import in innoDB took so much. For MyISAM, the difference between import and export is minimal.

And some common words. Generally speaking, this “article” in draft form appeared a few years ago. Now only the Sphinx has been added here and the tests have been repeated. And it arose as a result of a dispute at some forum, about the viability of utf and that they say other encodings will die in a year. But not dead.
And the problems in php / mysql, for example, are still very different. You must write utf, then utf-8, then utf8. And even utf can be both ru_RU.UTF and en_EN.UTF, which gives funny effects with iconv // ignore // translit, God knows why. If you install php as a module, the locale is the same for the whole server with all the consequences, and even with the correct locale you cannot use normal functions for working with strings, you must use their counterparts that support this work. In general, utf is certainly an advanced technology ... but it must be applied thoughtfully and without excessive fanaticism.

PS: For those who like to compress traffic proxy servers note - html file in utf8 is 5-20% more, even in gzip-e

Source: https://habr.com/ru/post/116822/

All Articles

The speed of working with utf is obvious, but little known for beginners.

More articles: