📜 ⬆️ ⬇️

Switch for strings in C ++ 11

Unfortunately, the C ++ standard does not allow the use of switch-case operators to string constants. Although in other languages, like C #, there is such a possibility right out of the box. Therefore, of course, many C ++ programmers have tried to write their own version of “switch for strings” - and I am not an exception.
For C ++ 03, the solutions did not differ in beauty and only complicated the code, additionally loading the application in runtime. However, with the advent of C ++ 11, it finally became possible to implement such code:

std::string month; std::string days; std::cout << "Enter month name: "; std::cin >> month; SWITCH (month) { CASE("february"): days = "28 or 29"; break; CASE("april"): CASE("june"): CASE("september"): CASE("november"): days = "30"; break; CASE("january"): CASE("march"): CASE("may"): CASE("july"): CASE("august"): CASE("october"): CASE("december"): days = "31"; break; DEFAULT: days = "?"; break; } std::cout << month << " has " << days << " days." << std::endl; 

The implementation of this design is very simple. It is based on constexpr-functions from C ++ 11, due to which almost all calculations are performed at the compilation stage. If someone is interested in its details, welcome under the cut - the benefit of Habré says nothing about the “switch for strings”.


What do we want?


First of all, it should be a full-fledged switch , and not its “emulation” by hiding if- operators and functions for comparing strings inside macros, since comparing strings in runtime is an expensive operation, and it is too wasteful for each row from CASE. Therefore, this solution does not suit us - besides, incomprehensible macros like END_STRING_SWITCH inevitably appear in it.
')
In addition, it is very desirable to use the compiler to the maximum. For example, what will happen to the “normal” switch-case in the case when the arguments of two cases turn out to be the same? That's right, the compiler will immediately curse us: “duplicate case value” , and stop compiling. And in the example of the above link, of course, no compiler will be able to notice this error.

The final syntax is also important, simple and without unnecessary constructions. That is why the well-known version of " std :: map with a string key" does not suit us either: firstly, the case arguments in it do not look vivid - and secondly, it requires mandatory initialization of the used std :: map in runtime. The function of this initialization can be anywhere, constantly looking into it is too tiring.


We start to calculate the hash


The last option remains: calculate the hash of the string in the switch , and compare it with the hash of each string in the case . That is, it all comes down to comparing two integers for which the switch-case works great. However, the C ++ standard says that the argument of each case should be known when compiling - therefore, the function “calculate hash from a string” should work in compile-time. In C ++ 03, it can only be implemented using templates, like the CRC calculation in this article . But in C ++ 11, fortunately, there appeared clearer constexpr functions, the values ​​of which can also be computed by the compiler.

So, we need to write a constexpr-function, which would operate with numeric codes of char- symbols. As you know, the body of such a function is " return <an expression> known in compile-time> ". Let us try to implement its most “degenerate” version, namely, the function of calculating the length of the const char * string. But here the first difficulties await us:

  constexpr unsigned char str_len(const char* const str) { return *str ? (1 + str_len(str + 1)) : 0; } std::cout << (int) str_len("qwerty") << " " << (int) str_len("") << std::endl; //  

The compiler does not swear, the function is correct. However, for some reason she didn’t bring me “6 6”, but “6 12”. Why so? And the thing is that I typed this source code under Windows in the “Linux” encoding of UTF-8, and not in the standard Win-1251 - and therefore each “Cyrillic” character was perceived as two. Now, if you change the encoding to the standard one, then "6 6" is really displayed. Well, it turns out, our idea failed? After all, this is not the case when, with different encodings, different hashes are obtained ...


We check the contents of the line


But why do we need Cyrillic or Asian characters? In the overwhelming majority of cases, for source codes, only English letters and standard punctuation marks are sufficient - that is, characters that fit in the range from 0 to 127 in the ASCII table . And their char- codes will not change when changing the encoding - and therefore the hash from the string composed only of them will always be the same. But what if the programmer accidentally enters one of these characters? The following compile-time function comes to the rescue:

  constexpr bool str_is_correct(const char* const str) { return (static_cast<signed char>(*str) > 0) ? str_is_correct(str + 1) : (*str ? false : true); } 

It checks whether the string known at the compilation stage contains only characters from the range 0-127, and returns false if the "forbidden" character is found. Why do I need a forced caste to signed char ? The fact is that in the C ++ standard it is not defined what exactly the type is char - it can be either signed or unsigned. But its sizeof will always be equal to 1, which is why we shift to the right by one. Thus, the CASE macro we need will look like:

  #define CASE(str) static_assert(str_is_correct(str), "CASE string contains wrong characters");\ case str_hash(...) 

Here one more feature of C ++ 11 is used - assert when compiling. That is, if the string argument of the CASE macro contains at least one “forbidden” character, the compilation will stop with an understandable error. Otherwise, the hash will be calculated, the value of which will be substituted in the case . This solution will save us from problems with the encoding. It remains only to write the function str_hash () itself, which will calculate the desired hash.


Go back to the hash calculation


You can choose a hash function in different ways, and the most important question here is the possibility of collisions. If the hashes of the two different strings match, then the program can jump to the false case with the switch . And the satellite will fall into the ocean ... Therefore, we will use the hash function, which has no collisions at all. Since it is already established that all characters of the string are in the range of 0-127, the function will be: image . Its implementation is as follows:

  typedef unsigned char uchar; typedef unsigned long long ullong; constexpr ullong str_hash(const char* const str, const uchar current_len) { return *str ? (raise_128_to(current_len - 1) * static_cast<uchar>(*str) + str_hash(str + 1, current_len - 1)) : 0; } 

Here raise_128_to () is the compile-time function of raising 128 to the power, and current_len is the length of the current line. Of course, the length can be calculated at each recursion step, but this will only slow down the compilation - it is better to count it before the first run of str_hash (), and then always substitute it as an additional argument. For what is the maximum length of the string, this function will not have collisions? Obviously, only when the value obtained by it always fits in the range of the unsigned long long type (also finally entered in C ++ 11), that is, if it does not exceed 2 64 -1.

It can be calculated that this maximum length will be equal to 9 (we denote this number as MAX_LEN). But the maximum possible hash value will be exactly 2 63 for a string, all characters of which have code 127 (and are unreadable in ASCII, so under CASE we will not chase it anyway). But what if under CASE will be a string of 10 characters? If we really don’t want the occurrence of collisions, then we need to prohibit this possibility as well - that is, extend the static_assert we already use:

  #define CASE(str) static_assert(str_is_correct(str) && (str_len(str) <= MAX_LEN),\ "CASE string contains wrong characters, or its length is greater than 9");\ case str_hash(str, str_len(str)) 


Make final touches


Everything with the CASE macro is over. It will either give us a compilation error, or calculate a unique hash value. Here for calculating the hash in the SWITCH macro, we will have to make a separate function, since it will work in runtime. And if its argument string has a length of more than 9 characters, then we agree to return 2 64 -1 (we denote this number as N_HASH). So:

  #define SWITCH(str) switch(str_hash_for_switch(str)) const ullong N_HASH = static_cast<ullong>(-1); //    std::string::npos inline ullong str_hash_for_switch(const char* const str) { return (str_is_correct(str) && (str_len(str) <= MAX_LEN)) ? str_hash(str, str_len(str)) : N_HASH; } 

Actually, that's all. In runtime, the hash for the string in SWITCH is calculated, and if one row in CASE has the same hash, the execution will go to it. If any string from CASE contains “forbidden” characters, or its length is more than 9 characters, we will get a reliable compilation error. There is almost no load in runtime (except for a single hash calculation for SWITCH), the readability of the code does not suffer. It remains only to overload the str_hash_for_switch () function for std :: string , and conclude everything inside the namespace.

The final h-file of source codes lies on Gitkhab . It is suitable for any compiler with C ++ 11 support. To use “switch for strings”, simply turn on str_switch.h wherever you want, and that’s all - the SWITCH, CASE and DEFAULT macros are in front of you. Do not forget about the limit on the length of the string in CASE (9 characters).
In general, I hope that this implementation will be useful to somebody;)

PS Upd: a couple of inaccuracies were found in the code - so now it has been updated, along with the article. The comments also suggested using another function to calculate the hash - of course, everyone can write their own implementation.

Source: https://habr.com/ru/post/166201/


All Articles