Unicode Normalization

Once I had to observe how spammers bypass the spam filter in a very interesting way. Instead of the traditional URL of example.com, the link looked like this:
http://example．com
A link with such a sophisticated point works in IE7, FF3, Opera 9.5, Safari 3, Google Chrome and does not work in IE6.

UAX # 15: Unicode Normalization Forms

A little thought, I began to look for a solution to the problem. Since the point clearly belonged to the class of esoteric Unicode characters (as I learned later, this is a Japanese full-width point), I decided to look into the appropriate standard, and there I found the answer to the question that concerns me. It turns out that there are procedures for normalizing text, after which it is suitable for comparison.

Composition, decomposition, and transformation of exotic characters

In Unicode, there are 4 types of normalization. The first two of them - composition and decomposition - allow you to cope with the following problems:

In Unicode, one and the same complex letter of the type “Ç” can be presented in two forms: as a single letter and as a base letter (“C”) and modifiers. The process in which all letters are merged into one, if possible, is called a composition (Normalization Form C, hereafter - NFC), and the process, in which all letters are broken down into modifiers, is decomposition (Normalization Form D, hereinafter - NFD).
If there are several modifiers, they can be distributed in a different order.
The same letter can have several variants (for example, "Ω" and "Ω")

To clarify all of the above, here are some illustrations from the standard:
NFC NFD

Further. There are many characters, such as the “« ”point above, that look very similar to others and can be meanly used by spammers. Especially for such cases, there is the Normalization Form KC (NFKC) and the Normalization Form KD (NFKD), which, in addition to (de) the composition, normalize the following characters:

Sophisticated fonts (ℍ and ℌ)
Mugs (①)
Changed size and angle of rotation (ｶ and, ︷ and {)
Degrees (⁹ and ₉)
Fractions (¼)
Other (™)

Let's see in action:
NFKC NFKD

Thus, NFKC / NFKD is exactly what we need to protect against spammers and other evil spirits. It remains only to fasten it to the program.

Implementation

For C / C ++ there is an ICU library - I think that most who had to work with Unicode under C / C ++ know about it. For those who do not know: here is the official site . In ICU, all normalization is done through the Normalizer class.
For Java, there is the same ICU and the same class Normalizer.
For PHP, everything is more complicated. I know at least two ways:
- Use the Normalizer class from the intl library.
- If for some reason it is impossible to use the intl library, you can take a ready-made implementation from MediaWiki ( via SVN ), which is implemented there as an independent subsystem.

I will give a simple example (in connection with the main language and the main project I will use the last library I specified):

<?php
require_once( 'normal/UtfNormal.php' );
$input = "http://example．com" ;
echo "{$input}\n" ;
echo UtfNormal :: toNFKC ( $input ) . "\n" ;

This program displays the following:

 http://example．com
 http://example.com

Total

As we can see, NFKC / NFKD allows us to cut down the possibilities for the "game with letters", and is indispensable in spam filters and blockers. NFC in addition allows us to compress text.

Source: https://habr.com/ru/post/45489/

All Articles

Unicode Normalization

UAX # 15: Unicode Normalization Forms

Composition, decomposition, and transformation of exotic characters

Implementation

Total

More articles: