National domain names: from ASCII format to IDN and back

If the need arises to work with national domain names, then for most cases the “xn-abrakatabra.com” format coming from the client will be sufficient. But there are cases when it is necessary to work with domain names in their national representation, i.e. Example.com.

This article discusses the software implementation of the encoding of national domain names from the ASCII format to IDN and back using MS VisualStudio and the ICU library.

Story.
If you have already heard the abbreviation IDN, then the following four paragraphs can be safely skipped.
')
Historically, ASCII characters were used to represent domain names on the Internet: “Az”, “0-9”, “-”. With the development of the Internet, characters began to be missed (more precisely, short and convenient names) and ICANN stated the need to expand the characters for the representation of domain names by using national alphabets (represented in Unicode).

IDN - (English Internationalized Domain Names - Internationalized Domain Names) are domain names that contain characters of national alphabets. For example, "site.com".

Numerous discussions in the few IDN forums boil down to two opinions: “go nuts, give two!” And “try to deceive us, to put it mildly.” The second is based on the specifics of the implementation of this technology.

New characters are well coded old ones :)

In essence, an IDN is a convenient and beautiful wrapper for a long and uncomfortable set of characters. On the client side, national characters are encoded into valid ASCII characters, which are the domain name. If you enter “example.test” into the address bar, it is recoded into “xn - e1afmkfd.xn - 80akhbyknj4f”. This is done using the ASCII family of compatible encodings (ACE) - Punycode, currently used in the multilingual domain name system. The Punycode coding algorithm is quite simple and is described in detail in RFC-3492 (it is also implemented in C).

What encoding and transcoding tools are at our disposal?

1. Microsoft tools.

In VisualStudio, the IdnMapping class is implemented in the System.Globalization namespace, among which methods you can find such as GetAscii and GetUnicode, which perform the transcoding in accordance with IDNA standards. Not a class, but a dream - nowhere easier:

using namespace System::Globalization; using System::String; String^ s1 = "."; String^ s; IdnMapping idn; s = idn.GetAscii(s1, 0, s1->Length); System::Console::WriteLine(s); String^ s2 = "xn--b1agh1afp.xn--e1afmkfd"; s = idn.GetUnicode(s2, 0, s2->Length); System::Console::WriteLine(s);

Result:

xn - b1agh1afp.xn - e1afmkfd
hi.example

For the same purposes, small - scale ones have two API functions IdnToAscii and IdnToUnicode . Unfortunately, Minimum supported client - Windows Vista. Very sorry. An example of using the function can be found on their website .

2. ICU (International Components for Unicode) funds. ICU is C / C ++ and Java open source libraries that support and enable Unicode and Globalization. The following domain name conversion functions are implemented in this library:

int32_t uidna_toUnicode / uidna_toAscii (const UChar * src, int32_t srcLength, UChar * dest, int32_t destCapacity, int32_t options, UParseError * parseError, UErrorCode * status)

- used for ASCII to IDN / IDN to ASCII conversions of simple names (components of a domain name). For example, “www.example.com” consists of three parts - “www”, “example”, “com”.

int32_t uidna_IDNToUnicode / uidna_IDNToASCII (const UChar * src, int32_t srcLength, UChar * dest, int32_t destCapacity, int32_t options, UParseError * parseError, UErrorCode * status)

- used for ASCII to IDN / IDN to ASCII conversions of fully qualified domain names. For example, "www.example.com".

Options:

src is a pointer to the input string to be converted.
srcLength - the length of the src. If src is a bc string, then you can specify -1.
dest is a pointer to the strings where the converted string will be written.
destCapacity - the size of dest.
Options - a bit of options. It can be one of the following values:
UIDNA_DEFAULT is the default. If an error occurs, returns U_UNASSIGNED_ERROR.
UIDNA_ALLOW_UNASSIGNED - if this flag is set, it is considered that unassigned code elements in the input line are in Unicode.
UIDNA_USE_STD3_RULES - the domain name syntax must comply with STD3 ASCII standards. If an error occurs, returns U_IDNA_STD3_ASCII_RULES_ERROR.

parseError - pointer to the UParseError structure. Can be set to zero.
status - error code.

The return value is the length of the converted string. To avoid overflow, you need to compare with destCapacity.

#include "unicode/utypes.h" #include "unicode/parseerr.h" #include "unicode/uidna.h" wchar_t* s1 = L"."; wchar_t pPunycode[MAX_PATH]; UErrorCode status = U_ZERO_ERROR; int32_t i = uidna_IDNToASCII(s1, -1, pPunycode, MAX_PATH, UIDNA_USE_STD3_RULES, NULL, &status); if(status == U_IDNA_STD3_ASCII_RULES_ERROR) wprintf(L"Error");

wchar_t* s2 = L"xn--e1afmkfd.xn--e1afnjf"; wchar_t pUnicode[MAX_PATH]; UErrorCode status = U_ZERO_ERROR; int32_t i = uidna_IDNToUnicode(s2, -1, pUnicode, MAX_PATH, UIDNA_ALLOW_UNASSIGNED, NULL, &status); if(status == U_IDNA_STD3_ASCII_RULES_ERROR) wprintf(L"Error")

The results are similar to the previous example.

Before using the library you need to collect. In order (for MS VS):

1. Choose the latest release (mine is ICU4C 4.4 2010-03-17) here .
2. Download sortsy.
3. Configure the PATH environment variable: “\ bin \”
4. Open the solution: “\ source \ allinone \ allinone.sln”
5. Build-> Batch Build ...-> Select All-> Rebuild.
6. Build-> Rebuild Solution.

If not collected, open “\ Readme.html -> How To Build And Install ICU and check. If you are going without errors - use.

Ps I will be glad to any comments and amendments.
Pp.s. Also I will be glad to interesting additions on the topic.

Source: https://habr.com/ru/post/89247/

All Articles

National domain names: from ASCII format to IDN and back

More articles: