📜 ⬆️ ⬇️

Localization of applications for the Chinese market-2. Block lists

In the previous topic, I just made a brief overview of the requirements for a developer who wants to release his application on the PRC market. Of all the questions received, the two main ones concerned precisely the block lists and the withdrawal of money from China. I want to consider the first point in more detail.

禁 - means "prohibited"


Naturally, such questions should begin to study with the regulatory framework. One of the main documents on this subject is 文化 部令 第 49 号 - 网络 游戏 管理 办法 (Decree of the Ministry of Culture of the People's Republic of China No. 49 “Temporary measures for regulating and managing online games”). Article 9 lists the content that is unacceptable in online games:
“Article 9. In online games there should not be the following content:
1) Violating the basic principles of the Constitution of China;
2) creating a threat to the national unity, sovereignty and territorial integrity of the PRC;
3) disclosing state secrets, endangering national security or harming national honor and interests;
4) inciting ethnic hatred or ethnic discrimination, undermining national unity, or infringing national customs and habits;
5) Promoting cults and superstitions;
6) Distributing rumors that disrupt public order, undermine social stability;
7) Disseminating obscenity, pornography, gambling, violence or abetting offenses;
8) An insultant who slanders others and infringes upon their legitimate rights and interests;
9) Contradicting public morality;
10) Other content prohibited by the laws, administrative and government regulations.
Everything is more or less sensible and reasonable. We will not discuss the feasibility of such restrictions, but let's talk about how to approach the technical implementation of this task.
Immediately, a small remark - if you have a large project that is expected to give a big return, the following text does not make sense to you and you better get an enterprise solution in which all this has long been implemented at the highest level, the bases are updated daily, and technical support is good and loyal. For example, such products are offered by 古 尼 - their “Public Mood Control System” is really smart and includes everything that is possible, two impossible functions and one incredible one. True, this decision will cost you 150,000 yuan a one-time + 20,000 yuan annually. There are other firms, but the price range is about the same.
If you have a small project, the following text is for you.
According to Chinese law, all user-generated content should be censored. The user can create the following content:
1) text (nicknames, messages to other players)
2) graphic (avatars)
3) video

So, point one - censoring text content.
In the very realization of the impossibility of recruiting (saving) certain words and sentences, there is nothing complicated. But it is at first glance. You, like 100% of foreigners, have no idea about which words and sentences should be blocked. Nothing complicated, such tables are easily found on the Chinese developer forums, searched for 敏感 词库 (the base of sensitive words and expressions) and immediately downloaded in XLS or XML format, and they are easily embedded in any block list.
Difficulties begin later.
Firstly, any resident of the PRC understands and can read both traditional hieroglyphs and simplified ones. But in the Unicode system the same hieroglyph in the traditional and simplified form is two different characters. Accordingly, you need to either inflate the database twice or three times by simply converting the simplified ones into the traditional ones (that is, instead of one record, the government will not be able to have two - 政府 无能 and 政府 ) or else convert them into the other one on the fly and compare them with the base. similar projects for conversion are searched for on request 简繁 转换 工具 代码 Secondly, a resident of the People’s Republic of China uses pinyin in 95% of cases for typing on a computer / phone. Pinyin uses phonetic transcription of hieroglyphic recordings and you choose the one you need by system It looks like T9. That is, in order to type the previous phrase, you need to type the following:

and choose the first option. When creating a block list, this causes certain difficulties. After all, if you replace 无 with wu, then the meaning of the phrase (政府 无能) will remain completely clear - kak ecJIu 6bI to write like this. That is, one should either apply the hieroglyph converter to transcription on the fly and compare it with the base, or inflate the base several more times (add values ​​无 无 neng, zheng 无能, etc.) This is apart from the fact that hieroglyphs can be masked by similar letters with accents (this is actually a key, but has nothing to do with the topic - w -, wù, wú, wū)
Third, we must take into account local tricks. How to bypass the block list in Russian, approximately represent everything. In Chinese, it is both simpler and more complex. Suppose 政府 (government) can be written as 正 夂 广 付 - while the native speaker will easily understand the meaning, but the system will not block it. To do this, there are also certain solutions based on heuristics and comparison of the elements of hieroglyphs - first, we look for 字根 (base of hieroglyphic elements), and then the relations between them and their possible combinations with each other are painstakingly written in it. Again, we must not forget that all this must be done for both traditional forms and simplified ones.
Fourthly, the base must be constantly updated, as the local bright minds are also not asleep, and are constantly looking for new ways to control.
')
Lock graphic content
It's all a little more complicated and you can't do without a final moderator, but you can (and even need) simplify his work.
First, the pictures may also contain textual information. To do this, you need to connect any OCR-module and link it with the text base from the item above
Secondly, there are several dozen styles of writing hieroglyphs. But the task is facilitated by the fact that any hieroglyph in writing using a specific writing style will be easily determined. To do this, we go to any website converter (like this , parse its database and assign a correspondence to each hieroglyph. For example, the above phrase about the government in the style of 简 黄草 will look like this:

And in the style of 中国 龙 新 草 体 like this:

Accordingly, it is necessary to essentially create for each hieroglyph in the base your own picture and assign it a match with the item in the block list.
Thirdly, it is necessary to use the existing methods of blocking pictures of a clearly pornographic nature, read the news weekly and add photos from the news to the table of blocking already graphic content (by analyzing the correspondence of images of the developments a huge amount all over the world), etc. Fully moderators from the work it will not save, but it will facilitate it considerably.
But the implementation of automatic censoring of video recordings is so cumbersome and the work of creating them on its own is so subdued that it is generally easier to disable this feature. Well, or attend to manual pre-moderation of all videos.
Actually, the implementation of the first two percent points will protect you by 90% from a loud departure from the Chinese market and give you a pass to all respected local app stores.
These are just basic actions. In the article, they are considered very superficially, the article does not consider the process of content filtering, which is sent from the application to third-party resources (social networks, microblogging, etc.), the process of filtering files for prohibited content and hundreds more items is not considered. Here is a more or less simplified diagram of how the Goonie solution works. After our own implementation of this, I now understand why they require money-)

Thanks for attention. I hope it was interesting.
PS An example of a basic text block list of 3,500 lines (it is suddenly interesting to someone) is located by reference

Source: https://habr.com/ru/post/245027/


All Articles