What is TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR (etc)

Many C ++ programmers writing under Windows are often confused over these strange identifiers like TCHAR, LPCTSTR. In this article I will try to dot the best way over I. And dispel the fog of doubt.

At one time, I spent a lot of time digging through the sources and did not understand what these mysterious TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR mean.
Recently I found a very competent article and present its high-quality translation.
The article is recommended for those who sleepless nights to crawl in C ++ codes.
')
You are interested ??
I ask under kat !!!

In general, a character string can be represented as 1 byte and 2 bytes.
Typically, a single-byte character is an ANSI character encoding — all characters are represented in this encoding. A 2-byte character is a UNICODE encoding in which all other languages in the world can be represented.

The Visual C ++ compiler supports char and wchar_t as built-in data types for ANSI and UNICODE encodings. Although there is a more specific definition of Unicode, but for understanding, Windows uses exactly 2-byte encoding for many application language support.

To represent 2-byte Unicode encoding, Microsoft Windows uses UTF16 encoding.
Microsoft has become one of the first companies that began to implement Unicode support in their operating systems (Windows NT family).

What to do if you want your C / C ++ code to be independent of encodings and the use of different coding modes?

TIP. Use common data types and names to represent characters and strings.

For example, instead of changing the following code:

char cResponse; // 'Y' or 'N' char sUsername[64]; // str* functions (  char       str*)

On that!!!

 wchar_t cResponse; // 'Y' or 'N' wchar_t sUsername[64]; // wcs* functions (  wchar_t       wcs*)

In order to support multilingual applications (for example, Unicode), you can write code in a more general manner.

 #include<TCHAR.H> // Implicit or explicit include TCHAR cResponse; // 'Y' or 'N' TCHAR sUsername[64]; // _tcs* functions (  TCHAR       _tcs*)

In the project settings on the GENERAL tab, there is a CHARACTER SET parameter that indicates in which encoding the program will be compiled:

If the “Use Unicode Character set” parameter is specified, the TCHAR type will be translated to the wchar_t type. If the parameter “Use Multi-byte character set” is specified, then TCHAR will be translated to type char. You can freely use the char and wchar_t types, and the project settings in no way affect the use of these keywords.

TCHAR is defined as:

 #ifdef _UNICODE typedef wchar_t TCHAR; #else typedef char TCHAR; #endif

The _UNICODE macro will be included if you specify “Use Unicode Character set” and then the TCHAR type will be defined as wchar_t. When you specify “Use Multi-byte character set” TCHAR will be defined as char.

In addition, for supporting multiple character sets using common base code, and possibly supporting many language applications, use specific functions (that is, macros).
Instead of using strcpy, strlen, strcat (including protected versions of the function with the _s prefix), or wcscpy, wcslen, wcscat (including protected versions), you'd better use the _tcscpy, _tcslen, _tcscat functions.

As you know, the strlen function is described as follows:

 size_t strlen(const char*);

And the wcslen function is described as follows:

 size_t wcslen(const wchar_t* );

You'd better use _tcslen, which is logically described as:

 size_t _tcslen(const TCHAR* );

WC is a Wide Character. Therefore, wcs functions will be for wide-character-string (i.e., for a large-character string). Thus _tcs will mean _T character string. And as you know, strings with the _T prefix can be of type char or wchar_t.

But in reality _tcslen (and other functions with the _tcs prefix) are not functions at all, they are macros. They are simply described as:

 #ifdef _UNICODE #define _tcslen wcslen #else #define _tcslen strlen #endif

You can view the TCHAR.H header file and search for more Macro descriptions like the one above.

Thus, TCHAR is not a type at all, but a superstructure over types char and wchar_t. Allowing thereby to choose a multi-language application, we will have or all the same, one language.

You ask why they are described as macros, and not as a full feature ??
The reason is simple: A library or DLL can export a simple function with the same name and prototype (Excluding the concept of overloading in C ++).
For example, if you export a function:

 void _TPrintChar(char);

How should the client call her ?? How:

 void _TPrintChar(wchar_t);

_TPrintChar can magically be converted to a function that takes a two-byte character as an argument.

To do this, we will do two different functions:

  void PrintCharA(char); // A = ANSI (  ) void PrintCharW(wchar_t); // W = Wide character ( )

And a simple macro will hide the difference between them:

 #ifdef _UNICODE void _TPrintChar(wchar_t); #else void _TPrintChar(char); #endif

The client will simply call the function as

 TCHAR cChar; _TPrintChar(cChar);

Note that TCHAR and _TPrintChar will now be comparable to UNICODE or ANSI, and the variable cChar and the function parameter will be comparable to the char or wchar_t data type.

Macros allow us to circumvent these difficulties, and allow us to use ANSI or UNICODE functions for our characters and strings. Many Windows functions are described this way, and for a programmer there is only one function (that is, a macro) and this is good.

I will give an example with SetWindowText:

 // WinUser.H #ifdef UNICODE #define SetWindowText SetWindowTextW #else #define SetWindowText SetWindowTextA #endif // !UNICODE

There are only a few functions that do not have such macros, and they are only with the suffix W or A. An example of this is the ReadDirectoryChangesW function, which has no equivalent in the ANSI encoding.

As you know, we use double quotes to represent strings. The string represented in this manner is an ANSI string, for each character 1 byte is used. I will give an example:

 “ ANSI .    1 .”

The above string is not a UNICODE string, and is not suitable for a lot of language support. In order to get a UNICODE string, you need to use the prefix L.
I will give an example:

 L” Unicode .     2 ,  . ”

Put an L in front and you get a UNICODE string. All characters (I repeat all characters) occupy 2 bytes, including English letters, spaces, numbers and the null character. The amount of Unicode string data will always be a multiple of 2 bytes. Unicode string of 7 characters will occupy 14 bytes. If the Unicode string is 15 bytes, then this is not the correct string, and it will not work in any context.

Also, the string will be a multiple of the sizeof size (TCHAR) in bytes.

When you need hard-coded code, you can write code like this:

 " ANSI"; // ANSI L" Unicode"; // Unicode _T(" ,   "); // ANSI  Unicode //    TEXT,

Strings without a prefix are ANSI strings, with a L prefix, Unicode strings, and strings with the _T prefix and TEXT are compilation dependent. Again, _T and TEXT are macros again. They are defined as:

 //  #ifdef _UNICODE #define _T(c) L##c #define TEXT(c) L##c #else #define _T(c) c #define TEXT(c) c #endif

The ## symbol is the key (token) of the operator insertion, which converts _T ("Unicode") to L "Unicode", where the string is an argument for the macro, unless of course _UNICODE is defined.
If _UNICODE is not defined then _T (“Unicode”) will turn it into “Unicode”. The operator's insert key even existed in C, and this is not a specific thing related to the encoding of strings in VC ++.

Note that macros can be used not only for strings, but also for characters. For example, _T ('R') turns it into L'R 'well, or just' R '. Ie either in the Unicode character or in the ANSI character.

No, and again no, you cannot use a macro to convert a character or string to Unicode and not Unicode text.
The following code will be incorrect:

 char c = 'C'; char str[16] = "Habrahabr"; _T( c ); _T(str);

Strings _T (c); _T (str); compile perfectly in ANSI mode, _T (x) will turn into x, and _T (c) along with _T (str) will turn into just c and str.
But when you build a project in Unicode mode, the code does not compile:

 error C2065: 'Lc' : undeclared identifier error C2065: 'Lstr' : undeclared identifier

I would not want to cause a stroke of your intellect and explain why it does not work.

There are several functions for converting multibyte strings to UNICODE, which I will discuss shortly.

There is an important note that almost all functions that a string or character takes, which is prioritized in the Windows API, have a generic name in MSDN and elsewhere.
The SetWindowTextA / W function will be classified as:

 BOOL SetWindowText(HWND, const TCHAR*);

But as you know, SetWindowText is just a macro, and depending on the project settings it will be viewed as:

 BOOL SetWindowTextA(HWND, const char*); BOOL SetWindowTextW(HWND, const wchar_t*);

So do not break your head if you can not get the address of this function:

 HMODULE hDLLHandle; FARPROC pFuncPtr; hDLLHandle = LoadLibrary(L"user32.dll"); pFuncPtr = GetProcAddress(hDLLHandle, "SetWindowText"); // pFuncPtr  null,      SetWindowText

In the User32.DLL library, there are 2 functions SetWindowTextA and SetWindowTextW which are exported, that is, there are no names with a generic name.

All functions that have ANSI and UNICODE versions generally have only UNICODE implementations. This means that when you call SetWindowTextA from your code, passing an ANSI string parameter - it converts ANSI to UNICODE and calls SetWindowTextW.
The real work (setting the title / title / window label) is only the Unicode version!

Take another example that will retrieve window text using GetWindowText.
You call GetWindowTextA by passing it an ANSI buffer as the target buffer.
GetWindowTextA first calls GetWindowTextW, possibly allocating memory for the Unicode string (that is, the wchar_t array).
It then converts the Unicode to the ANSI string for you.

These ANSI to Unicode conversions are not a limitation of only GUI functions, but the whole subset of the Windows API functions that accept strings and has two options.
I will give another example of such functions:

CreateProcess
Getusername
Opendesktop
DeleteFile
etc

Therefore, it is highly recommended to call Unicode functions directly.
In turn, this means that you should always aim at building the Unicode version, and not at the ANSI version, given the fact that you have been used to using ANSI strings for many years.

Yes, you can save and receive ANSI lines, for example, to write to a file, or send a chat message to your chat program. Conversion functions exist for such needs.

Note: There is one more type description: the name to it is WCHAR - it is equivalent to wchar_t.

TCHAR is a macro for declaring a single character. You can also declare a TCHAR array. And what if for example you want to describe a pointer to characters or, a constant pointer to characters.
I will give an example:

 // ANSI  foo_ansi(char*); foo_ansi(const char*); /*const*/ char* pString; // Unicode/wide-string foo_uni(WCHAR*); wchar_t* foo_uni(const WCHAR*); /*const*/ WCHAR* pString; //  foo_char(TCHAR*); foo_char(const TCHAR*); /*const*/ TCHAR* pString;

After reading chips with TCHAR, you probably prefer to use it. There are still good alternatives for representing strings in your code. To do this, simply include Windows.h in the project.
Note: If your project includes windows.h (indirectly or directly), you should not include TCHAR.H in the project.
To begin with we will reconsider old function that it was easier to understand. Example function strlen.

 size_t strlen(const char*);

Which can be presented differently.

 size_t strlen(LPCSTR);

Where LPCSTR is described as:

 // Simplified typedef const char* LPCSTR;

LPCSTR is understood as.
• LP - Long Pointer (long pointer)
• C - Constant (constant)
• STR - String (string)
Essentially, LPCSTR is a (Long) pointer to a string.

Let's change the strcpy to match the new style of the type name:

 LPSTR strcpy(LPSTR szTarget, LPCSTR szSource);

szTarget is of type LPSTR, without using the C language types. LPSTR is defined as:

 typedef char* LPSTR;

Note that szSource is of type LPCSTR, since the strcpy function does not modify the source buffer, therefore the const attribute is set. The returned data type is not a constant string: LPSTR.

So, functions with the prefix str for manipulating ANSI strings. But we need another two byte Unicode strings. For the same large symbols, there are equivalent functions.
For example, to calculate the length of characters of large characters (Unicode strings), you will use wcslen:

 size_t nLength; nLength = wcslen(L"Unicode");

The prototype of the wcslen function is:

 size_t wcslen(const wchar_t* szString); //  WCHAR*

And the code above can be presented differently:

 size_t wcslen(LPCWSTR szString);

Where LPCWSTR is described as:

 typedef const WCHAR* LPCWSTR; // const wchar_t*

LPCWSTR can be understood as:
LP - Long Pointer (Long Index)
C - Constant (constant)
WSTR - Wide character String (large character string)

Similarly, the strcpy equivalent of wcscpy, for Unicode strings:

 wchar_t* wcscpy(wchar_t* szTarget, const wchar_t* szSource)

Which can be represented as:

 LPWSTR wcscpy(LPWSTR szTarget, LPWCSTR szSource);

Where szTarget is not a constant big string (LPWSTR), but szSource is a constant big string.

There are a number of equivalent wcs functions for str functions. the str functions will be used for plain ANSI strings, and the wcs functions for unicode strings.

Although I have already advised that you should use native Unicode functions, not just ANSI or just synthesized TCHAR functions. The reason is simple - your application should only be Unicode, and you should not worry about whether they are being sorted for ANSI. But for the sake of completeness, I mentioned these general mappings (projections) !!!

To calculate the length of a string, you can use the _tcslen function (macro).
Which is described as follows:

 size_t _tcslen(const TCHAR* szString);

Or so:

 size_t _tcslen(LPCTSTR szString);

Where the name of the LPCTSTR type can be understood as
LP - Long Pointer (Long Index)
C - Constant (Constant)
T = tchar
STR = String (String)

Depending on the project settings, LPCTSTR will be projected into LPCSTR (ANSI) or LPCWSTR (Unicode).

Note that the strlen, wcslen, or _tcslen functions will return the number of characters in a string, not the number of bytes.

The generic _tcscpy string copy operation is described as follows:

 size_t _tcscpy(TCHAR* pTarget, const TCHAR* pSource);

Or in an even more generalized manner, like:

 size_t _tcscpy(LPTSTR pTarget, LPCTSTR pSource);

You can guess what LPTSTR means))

Examples of using.

First I give an example of a non-working code:

 int main() { TCHAR name[] = "Saturn"; int nLen; // Or size_t lLen = strlen(name); }

On an ANSI assembly, this code successfully compiles because TCHAR is of type char, and the variable name is an array of char. Calling strlen for a name will work just fine too.

So. Let's compile the same with the UNICODE / _UNICODE enabled (in the project settings, select “Use Unicode Character Set”).
Now the compiler will produce this kind of error:

 error C2440: 'initializing' : cannot convert from 'const char [7]' to 'TCHAR []' error C2664: 'strlen' : cannot convert parameter 1 from 'TCHAR []' to 'const char *'

And programmers will begin to correct the error in this way:

 TCHAR name[] = (TCHAR*)"Saturn";

And this will not pacify the compiler, because converting from TCHAR * to TCHAR [7] is impossible. The same error will occur when embedded ANSI strings are passed to Unicode functions:

 nLen = wcslen("Saturn"); // error: cannot convert parameter 1 from 'const char [7]' to 'const wchar_t *' // :      1  'const char [7]'  'const wchar_t *'

Unfortunately (or fortunately), this error can be incorrectly corrected by simply casting the C language types.

 nLen = wcslen((const wchar_t*)"Saturn");

And you think that you have increased your experience when working with pointers. YOU are not right - this code will give an incorrect result, and in most cases you will receive Access Violation (access violation). Type casting in this way is like passing a float variable when it is expected (logically) to have a structure of 80 bytes.

The string "Saturn" is a sequence of 7 bytes:

'S' (83)

'a' (97)

't' (116)

'u' (117)

'r' (114)

'n' (110)

'\ 0' (0)

But when you send the same set of bytes to wcslen, it treats every 2 bytes as one character. Therefore, the first 2 bytes [97,83] will be considered as one character meaning 24915 (97 << 8 | 83). This is a Unicode character ??? .. And the other following characters are treated as [117,116] and so on.

Of course, you did not pass on these Chinese characters, but the type cast did it for you !!!
And therefore it is very important to know that casting will not work. So to initialize the first line, you must do the following:

 TCHAR name[] = _T("Saturn");

Which will translate to 7 or 14 bytes, depending on the compilation.
The wcslen call should be like this:

 wcslen(L"Saturn");

In the sample program code above, I used strlen, which causes errors when building Unicode.
I will give an example of a non-working solution with the casting of C types:

 lLen = strlen ((const char*)name);

On Unicode assemblies, the name variable will be 14 bytes in size (7 unicode characters, including null). Since the string
“Saturn” contains only English characters that can be represented using ASCII encoding, the Unicode character 'S' will be represented as [83, 0]. The following ASCII characters will be represented as zeros. Notice now the character 'S' is represented as a 2-byte value of 83. The end of the line will be represented as 2 bytes having a value of 0.

So, when you pass such a string to strlen, the first character (that is, the first byte) will be correct ('S' in the case of 'Saturn'). But the next character / byte will be identified as the end of the line. Therefore, strlen returns an invalid value of 1.

As you know, the Unicode string can contain not only English characters, and the result of strlen will be even more uncertain.

In short, type casting will not work.
You will have to either present the lines in the correct form, or use the ANSI to Unicode conversion functions, and vice versa.

Now, I hope you understand the following code:

 BOOL SetCurrentDirectory( LPCTSTR lpPathName ); DWORD GetCurrentDirectory(DWORD nBufferLength,LPTSTR lpBuffer);

Continuing the topic. You've probably seen some functions / methods that need to pass the number of characters, or return the number of characters. However, there is a GetCurrentDirectory to which the number of characters must be transferred, not the number of bytes.
Example:

 TCHAR sCurrentDir[255]; //  255   255*2 GetCurrentDirectory(sCurrentDir, 255);

On the other hand, if you need to allocate memory for the desired number of characters, you must allocate the proper number of bytes. In C ++, you can simply use the new operator:

 LPTSTR pBuffer; // TCHAR* pBuffer = new TCHAR[128]; //  128  256 ,    .

But if you use memory allocation functions such as malloc, LocalAlloc, GlobalAlloc, etc., you must specify the number of bytes!

 pBuffer = (TCHAR*) malloc (128 * sizeof(TCHAR) );

As you know, you need to cast the return type. The expression in the malloc argument ensures that it allocates the required number of bytes - and allocates space for the required number of characters.

PS

Original article

All with NG !!!

Source: https://habr.com/ru/post/164193/

All Articles

What is TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR (etc)

Examples of using.

PS

More articles: