πŸ“œ ⬆️ ⬇️

It turns out that modern C ++ compilers support unicode identifiers.

For a man who learned to program by Jermain and Stabley in the early 80s, it looks crazy. For those who have studied C ++ for Stroustrup's early editions, probably, too. It was a long time ago. Maybe for the next generation the same wildness seems to be what I'm trying to explain. Still, the topic is not quite so empty. To whom not laziness to listen to old man's grumble, come under kat.

The standard of the language is still stated: the characters β€œa β€” z, A β€” Z, _” can appear anywhere in the identifier; characters "0-9" everywhere except the beginning of the word. Other signs are not mentioned there. Microsoft in VisualStudio also assigns the dollar sign β€œ$” to the letters, and also stipulates that the maximum length of the names that the compiler still distinguishes is 2048 characters. More is possible, but the extra characters will be ignored. For details, see the corresponding MSDN page.

The problem that prompted to describe all this. If you remove all unnecessary, it looks like this:

void some(){ int c(0); ++; } //error C2065: '' : undeclared identifier 

It turns out that the second β€œc” symbol is the Russian letter es. It is not surprising that she appeared there, on the keyboard, these twin letters occupy one key. Just did not switch the layout in time, it happens.
')
It was surprising that the compiler perceived this symbol as a correct, albeit undeclared, identifier of the C ++ language. An error like β€œinvalid token”, etc. would be more expected. To understand what this means, we will torture the compiler a little more. To eliminate optical illusion, instead of "c", we put some more recognizable Russian letter. So, for example:

 void some(){ int (0); ++; } //OK 

Works.

We go further, adding all new and new signs. The experiment shows that almost the entire Unicode gets into the β€œallowed” character set β€” anything that is not reserved by the standard language for anything else. Any signs of any languages ​​that I could get in the Google translator. To conduct a full-fledged experience, you need to save the source in UTF-8 format, not forgetting to include BOM at the beginning of the file. For Russian letters you can not do that.

For example, such a program is compiled and executed without errors:

 #include <stdio.h> #include <math.h> #define 前 for #define ζ•΄ζ•° int #define ダブル double #define θ™šγ—γ„ void #define εˆ·γ‚‹γƒ• printf #define ァむン sin #define フフラッシγƒ₯ fflushθ™šγ—γ„ γγ‚Œγ‚’γ‚„γ‚‹(){前(ζ•΄ζ•° 私 = 0; 私 < 100; ++私){ダブル x = 2 * 3.1415926 * ダブル(私)/100;εˆ·γ‚‹γƒ•("\n%g;%g", x, ァむン(x)); }フフラッシγƒ₯(stdout); }ζ•΄ζ•° _tmain(ζ•΄ζ•°argc, _TCHAR* argv[]) {γγ‚Œγ‚’γ‚„γ‚‹(); return 0; } 

Apparently, the assessment of whether this is the correct symbol is obtained almost by accident. For Windows, the point of view on the question "which Unicode character is considered a letter" is determined by: a function from the Windows API :

 BOOL IsCharAlpha(TCHAR ch); 

The creator of this function rightly attributed to letters all that the speakers of the respective languages ​​mold into words. And compiler developers, it seems, do not quite use such functions.

We write a simple test example:

 #include "stdafx.h" #include <windows.h> #include <stdio.h> #include <conio.h> int _tmain(int argc, _TCHAR* argv[]){ TCHAR  = ''; // 8-bit "", false BOOL is_letter = IsCharAlpha(); printf("letter = %d\n", int(is_letter)); getch(); return 0; } 

The eight-bit Cyrillic character "y" with the code 0xFE converts to TCHAR into 0xFFFE, this is a unicode reserved area in which there are no letters. We get the expected false. Of the other characters, a negative result is shown by punctuation marks, pseudographic characters, and for some reason, Braille characters. The rest is considered letters. Here is a short list of tested codes:

 // TCHAR  = 0x044E;// UTF-16 "", true // TCHAR  = 0x00E1;// UTF-16 latin "small a with acute" letter, true // TCHAR  = 0x0633;// UTF-16 arabic "sin" letter, true // TCHAR  = 0x09A2;// UTF-16 bengalic "ddha" letter, true // TCHAR  = 0x0060;// UTF-16 "grave accent", false // TCHAR  = 0x00BD;// UTF-16 "one half ligature", false // TCHAR  = 0x27F5;// UTF-16 "long lefwards area", false 

Someone, perhaps, will rejoice at new opportunities. But I would not. I strongly doubt that such an extended set of variable names may be needed in the actual work of a programmer. And the fact that this is a source of error, many had the opportunity to see. It is especially unpleasant that in order to diagnose such errors, it may be necessary to analyze the text in a hex editor (in the case of duplicate letters). Of course, MSVS is no exception.

In general, you need to know about this feature of the current compilers. And when they show us "strange errors", it is better to type in the identifier that turned out to be a problem in the correct layout.

But for something it can be useful? Long thought on this issue, and still came up with. Here it is. Formerly, a peculiar Russified version of the Algol language was used to teach students. Such a language can now be easily emulated using the C ++ microcomputer compiler. Maybe someone wants to ponozhalirovat?

Like that:

 #define  { #define  if( #define  ) #define  else #define  } 

The end.

Source: https://habr.com/ru/post/238627/


All Articles