📜 ⬆️ ⬇️

Php tools for japanese

After working for a year in a Japanese company as a system engineer, I highlighted a few points that a programmer working with Japanese text must know.

There are 3 alphabets in Japanese, which makes working with text not as simple as in European languages. Let's start with the simplest and move on to the more complex.

As in Russia, sites with local coding remained in Japan, we translate the text into UTF-8:

$output = iconv('Shift-JIS', 'UTF-8//IGNORE', $input); 

There is such a feature among the Japanese: the text can be written in two ways - the so-called "full length" (Zen-Kaku) and "half the length" (Han-Kaku). The Japanese text is Zen-Kaku, the European text is Han-Kaku, but in principle any European text can be written as Zen-Kaku. Example: SALE. But especially the Japanese love zen-kaku numbers: 123.
')
To translate SALE to SALE:

 public function toHanKaku($text) { return mb_convert_kana($text, 'a'); } 

Why are the Japanese 3 alphabet? In the first kanji (漢字) - hieroglyphs are recorded. The second is hiragana (ひ ら が な), it can be said, the main alphabet, because any kanji can be written hiragana. And the third is katakana (カ タ カ ナ) and it is the same as hiragana, except for the fact that it records foreign words.

To translate hiragana to katakana (は ず れ in レ) (the same word in different alphabets), we use the same function, only with a different key:

 public function toKatakana($text) { return mb_convert_kana($text, 'C'); } 

The same problem with regular expressions. There is a date in the format: 2016.20.20. So that it is recognized, the mandatory condition modifier "u" is Unicode.

 preg_match('/[\d]{4}.[\d]{2}.[\d]{2}/u', $text, $match); 

Like everyone else, the Japanese do not bother with the addresses of the pages and write directly to the address with Japanese hieroglyphs, for example, like this:

 <a href="http://ja.wikipedia.org/wiki/ヤフー">ヤフー</a> 

The address is correct, but in order for it to be understood, for example, by Curl, it must be encoded:

 public function japaneseUrlEncode($text) { return preg_replace_callback( '/[^\x21-\x7e]+/', function($matches) { return urlencode($matches[0]); }, $text ); } 

The Japanese text entry system is also special, it allows you to add different decorations directly from the keyboard without additional tools: ♪ ♫ ☆ ★ 「」 ︎ ♡. All these things literally overflowing blogs, forums, twitter and similar places.

The problem is that when you try to insert some Emoji into a regular utf8_general_ci table, we will get truncated text and an error:

 Warning: #1366 Incorrect string value: '\xF0\x9F\x98\x8ASI...' for column 'content' at row 1 

But the point here is that utf8_general_ci is of course a very rich set of characters, but it is limited to 3 bytes per character and does not contain 4-byte emoji. The encoding that fits is utf8mb4.

An example of textual beauty in Japanese:
は な さ ま こ ん ん に ち は ★ SIENA ス ん ば パ ー ス ス 店 す ♡
日 は 入 荷 し た て の new ピ ア ス の ご 紹 介 で す ♫


But not everyone can change the encoding, there is another solution - delete 4-byte emoji:

 public function clear4byte($string) { return preg_replace('%(?: \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )%xs', '', $string); } 

I hope this material will be useful to those who work with the Japanese text, as well as all the rest for general development.

PS: The first publication failed, moderator's comment:
The feeling that the article ends somewhere in the middle. Complete the material and publish again.

What is called “deja vu” - just the case described in the last example occurred with Habr, namely: I added four-byte character encoding for clarity, which caused an error, and the text was saved to the database in a truncated form.

Source: https://habr.com/ru/post/304354/


All Articles