Generator utf-8 json on php with unicode 6 support

Of course, PHP has a great json_encode function. But up to version 5.3 inclusive, the same Russian characters are encoded as \ uXXXX - many times longer than utf-8. To reduce the amount of traffic, it is necessary to remove the utf-8 character conversion in the \ u-sequence. Yes, in PHP 5.4, json_encode finally has the JSON_UNESCAPED_UNICODE option, but many hosters still present users with a choice only between versions 5.2 and 5.3.

I would not reinvent the next bike, but the solutions I came across have a common problem - they correctly process only the characters of the Unicode base plane.

The method, in various modifications widely used on the Internet, is that the result of the json_encode function is processed by a filter that replaces all \ xXXX entries with utf-8 characters. For example:
')

class Json{ static function json_encode($data){ return preg_replace_callback('/\\\\u([0-9a-f]{4})/i', function($val){ return mb_decode_numericentity('&#'.intval($val[1], 16).';', array(0, 0xffff, 0, 0xffff), 'utf-8'); }, json_encode($data) ); } }

And this code worked ... Until you needed to add support for Unicode emoji (emoticons were added in Unicode 6), most of which have codes greater than 0x1F000 (the first unicode plane).

The fact is that the \ u-sequences are utf-16 encoded: the word (2 bytes) per character with the code from 0x0000 to 0xFFFF (excluding the “window” 0xD800-0xDFFF) and 2 words (4 bytes) with the codes 0xD800-0xDFFF for characters with codes greater than 0xFFFF.

For example, a source unicode symbol with code 0x1f601, having a utf-8 representation "\ xf0 \ x9f \ x98 \ x81", will be converted by the json_dencode function into the string "\ ud83d \ ude01" and the result of the above function will be the string "\ xed \ xa0 \ xbd \ xed \ xb8 \ x81 ". Instead of one 4-byte character received two 3-byte characters.

Thus, for normal processing of symbols, analysis of codes and a separate transformation of 2-character \ u-sequences are necessary. For example:

 class Json{ static public $_code; static public function json_encode($data){ Json::$_code=0; return preg_replace_callback('/\\\\u([0-9a-f]{4})/i', function($val){ $val=hexdec($val[1]); if(Json::$_code){ $val=((Json::$_code&0x3FF)<<10)+($val&0x3FF)+0x10000; Json::$_code=0; }elseif($val>=0xD800&&$val<0xE000){ Json::$_code=$val; return ''; } return html_entity_decode(sprintf('&#x%x;', $val), ENT_NOQUOTES, 'utf-8'); }, json_encode($data) ); } }

This option correctly converts any utf-8 characters.

PS I understand perfectly well that the above code is far from optimal. But it also works with sufficient performance, for my purposes. And to compare the speed of all the invented options is just lazy. Here, for example, the variant shifting the analysis to a regular expression:

 class Json{ static public function json_encode($data){ return preg_replace_callback('/\\\\ud([89ab][0-9a-f]{2})\\\\ud([cf][0-9a-f]{2})|\\\\u([0-9a-f]{4})/i', function($val){ return html_entity_decode( empty($val[3])? sprintf('&#x%x;', ((hexdec($val[1])&0x3FF)<<10)+(hexdec($val[2])&0x3FF)+0x10000): '&#x'.$val[3].';', ENT_NOQUOTES, 'utf-8' ); }, json_encode($data)); } }

PPS Calls to html_entity_decode are inserted into the callback function because the processed data may contain html code that includes service html entities ('<', '>', '&', etc.) that should not be converted to characters .

Source: https://habr.com/ru/post/195806/

All Articles

Generator utf-8 json on php with unicode 6 support

More articles: