How to make your C ++ code cross-platform?

Perhaps someone, after reading the title, asks: “Why do something with your code? After all, C ++ is a cross-platform language! ” In general, this is so ... but only now there are no strings on the specific capabilities of the compiler and the target platform ...

In real life, developers who solve a specific task for a specific platform rarely ask themselves: “Is this exactly what corresponds to the Standard C ++? What if this is an extension of my compiler. ” They write code, run the build, and repair the places their compiler has cursed.

As a result, we get an application that, to some extent, is “sharpened” for a specific compiler (and even for its specific version!) And the target OS. Moreover, due to the scarcity of the standard C ++ library, some things are simply impossible to write without using the specific API of the system.
')
So it was with us in Tenzor. We wrote on MS Visual Studio 2010. Our products were 32-bit Windows applications. And, of course, the code was permeated with all sorts of ties to the technology from Microsoft. Once we decided that it was time to explore new horizons: it was time to teach VLSI to work under Linux and other operating systems, it was time to try to switch to another hardware (POWER).

In this series of articles, I will tell you how we made our products real cross-platform applications; how they made them work on Linux, MacOS, and even on iOS and Android; how they launched their applications on a variety of hardware architectures (x86-64, POWER, ARM, and others); as taught to work on big-endian machines.

The basis of all our products is our own framework “VLSI Platform” (hereinafter referred to as “Platform”), which is comparable in scale to Qt. The platform has almost everything a developer needs: from simple functions of fast number-to-string conversion to a powerful fault-tolerant application server.

On the basis of the Platform, our developers implement their products (even mobile applications) that solve all sorts of business problems. We wanted to free their code (hereinafter, we will call their code “applied”) from all sorts of strings on the target software and hardware platform, hiding all the specifics in the depths of our framework.

The VLSI platform is written in C ++, but this does not limit the application programmer to choose a language, in addition to C ++, JavaScript, Python, SQL can be used.

Our company is actively developing its products, so it was necessary to "repair the train at full speed" :)

It was necessary to work in such a way that the rest of the developers would not suffer from our activities and would continue to develop their functionality under Windows on MSVC with comfort. This requirement has greatly affected many technical solutions and has greatly complicated the work.

In order for the reader to form an idea of the scale of the work, I will give some figures:

The code size of our framework is ~ 2 million lines
The volume of the “application” code (code based on the VLSI platform that solves specific business problems) is difficult to estimate, but it is several times larger than the volume of the Platform
Over a thousand programmers in ten development centers

The boring entry is over. Now let's get closer to the case and consider the problems we faced.

Using the operating system API

As mentioned above, the standard C ++ library is very poor; it does not include many of the necessary features everywhere. For example, in C ++ 11 there is no functionality for working with the network ... That is, as soon as we wanted to make the simplest HTTP request, we have to ... write a non-platform code!

The situation is even more aggravated if you are not using the latest version of the compiler, as we did - in MSVS 2010 disgusting support for C ++ 11, there is no huge part of the innovations in the core language and in the standard library.

But, fortunately, such problems are solved quite easily. There are several ways:

We write our class, with several platform-specific implementations based on the target system API calls. During the assembly, ifdef preprocessor directives choose the appropriate implementation.
We use cross-platform libraries - there are many ready-made cross-platform libraries (again, using platform-specific implementations within themselves), which greatly facilitate our task. For example, to implement an HTTP client, we took cURL.

Features of compiler implementations

Every program has bugs. And the compiler is also no exception. Therefore, even the code that is 100% compliant with the Standard may not be compiled on any compiler.

Also, almost all compiler developers consider it their duty to add features not provided by the Standard to their offspring, and thus provoke programmers to write intolerable code.

What we get in the end? Code that is clearly written according to the Standard may not be compiled on any compiler; code that compiles and runs on one compiler may not build up or make money wrong on the other ...

You can list many problems of this class. Here is one of them:

throw std::exception( "-   " ); //    MSVC++,

This code will be assembled in MSVC ++, since they have an additional constructor defined:

 exception( const char* msg ) noexcept;

Unfortunately, there are no general methods for solving such problems. In these cases, only the experience gained in studying the tools used in the work, and a good knowledge of the Standard C ++, helps.

In subsequent articles, I will return to this issue, describe in detail the most common problems and propose methods for solving them.

Indefinite behavior

In C ++ Standard, there is an interesting term “undefined behavior” (undefined behavior). Here is his definition from Wikipedia:

Undefined behavior (English undefined behavior, in some sources unpredictable behavior [1] [2]) is a property of some programming languages (most noticeable in C), software libraries and hardware in certain marginal situations to produce a result that depends on the implementation of the compiler ( , microcircuits) and random factors like memory conditions or triggered interrupts. In other words, the specification does not define the behavior of the language (libraries, microchips) in any possible situations, but says: “under condition A, the result of operation B is not defined”. Allowing such a situation in the program is considered an error; even if the program is successfully executed on some compiler, it will not be cross-platform and may fail on another machine, in another OS or with different compiler settings.

If you allow an undefined behavior in your program, this does not mean that it will fall or produce any errors in the console. Such a program may well work as expected. But any change in the compiler settings, switching to another compiler or to another version of the compiler, or even modifying any code snippet can change the behavior of the program and break everything!

Many situations with undefined behavior on one particular compiler give consistently identical behavior, and your carefully tested application will work like a Swiss watch. But as soon as we change the environment (for example, we try to run a program compiled by another compiler), these bugs begin to assert themselves and completely break the program.

The classic example of undefined behavior is going beyond the array on the stack. Below is a simplified code snippet from one of our applications with this problem. This bug did not manifest itself under Windows for several years and "shot" only after porting under Linux:

 std::string SomeFunction() { char hex[9]; // some code hex[9] = 0; //      return hex; }

Apparently, MSVS leveled the buffer on the stack, adding several bytes after it, and when overwriting someone else's memory, we got to an empty, unused space. And in GCC, the problem began to manifest itself in an interesting way - the program fell far from this code, in another function (apparently, GCC zainlaynil this function, and she began to rewrite the local variables of another function).

There are more elegant, elusive situations with UB. For example, you can step on a very interesting rake when using std :: sort:

 std::vector< std::string > v = some_func(); std::sort( v.begin(), v.end(), []( const std::string& s1, const std::string& s2 ) { if( s1.empty() ) return true; return s1 < s2; } );

It would seem, where can there be UB? And the whole thing in the "bad" comparator.
The comparator should return true if s1 needs to be put before s2. Consider what our comparator will issue if it receives two empty lines at the input:

s1 = "";
s2 = "";
cmp (s1, s2) == true => s1 should be in front of s2
cmp (s2, s1) == true => s2 should be in front of s1

Thus, there are situations where the comparator contradicts itself, that is, does not set strict weak ordering (link to en.wikipedia.org/wiki/Weak_ordering#Strict_weak_orderings ). Therefore, we violated the std :: sort requirements on the arguments and got unspecified behavior.

And this is not a sham example. Such a problem we caught while upgrading to Linux. The comparator with a similar error worked for many years under Windows and ... began to crash the application with SIGSEGV under Linux (i686). Interestingly, the bug behaves differently, even on different Linux distributions (with different GCCs on board): somewhere the application crashes, somewhere hangs, somewhere it simply sorts not as expected.

Often situations with undefined behavior can be caught with static analyzers (including those built into the compiler). Therefore, in the build settings, you should always set the maximum warning level. And in order not to lose the useful warning in the crowd of warnings of the “unused variable” type, it is useful to clean up the code once and then turn on the assembly option “treat warnings as errors” in order to prevent the appearance of new unsung warnings.

Data models

Standard C ++ does not give any hard guarantees about the representation of data types in computer memory; it sets only some relationships (for example, sizeof (char) <= sizeof (short) <= sizeof (int) <= sizeof (long) <= sizeof (long long)) and provides ways to determine the characteristics of types.

Different systems may differ significantly in the way types are represented. The dimensions of the base types are specified by the data model. The data model should be understood as the ratios of the types of dimensions adopted within the framework of the development environment. The table below lists the popular data models and shows the corresponding dimensions of the main types of C ++.

In the overwhelming majority of cases, when choosing a data type, a programmer needs guarantees about its size. But in practice, developers are often simply tied to the size of the basic types in the system on which they work. And again, when switching to a different software or hardware platform, we get surprises: some code stops gathering, some starts to work differently or stops working at all.

For example, the hash function below will produce different results on the same data when running on different platforms:

 unsigned long some_hash( const unsigned char* buf, size_t size ) { unsigned long res = 0; for( size_t i = 0; i < size; ++i ) res = res * buf[i] + buf[i] + i; return res; }

Most of these problems are solved by using types with a guaranteed size:

 std::int8_t, std::int16_t  . . std::uint32_t some_hash( const unsigned char* buf, size_t size ) { std::uint32_t res = 0; for( size_t i = 0; i < size; ++i ) res = res * buf[i] + buf[i] + i; return res; }

Char

I guess not many developers wondered if char was a sign. And if such a question arose, the majority opened their favorite development environment, wrote a small test program and got the answer ... true only for their system.

In fact, Standard C ++ does not stipulate char charity. Because of this, there are compiler implementations in which char is signed, but there are those where char is unsigned. And this is another reason due to which your program may refuse to work after building for another system.

For example, this code works as expected on Linux x86-64, but does not work on Linux POWER (when building in GCC with default parameters):

 bool is_ascii( char s ) { return s >= 0; }

To get rid of uncertainty, it is enough to add an explicit cast to the desired type:

 bool is_ascii( char s ) { return static_cast<signed char>( s ) >= 0; }

in our example, it is possible to completely rewrite the code for bit operations:

 bool is_ascii( char s ) { return s & 0x80 == 0; }

String representation

Standard C ++ does not regulate some aspects in any way, and each compiler solves these issues at its discretion.

For example, there are no guarantees as string constants will be represented in memory.
The MSVS compiler encodes string constants in Windows-1251, and GCC encodes UTF-8 by default.

Because of such differences, the same code will produce different results: strlen ("Habr") in the program compiled on MSVS will produce 4; in the GCC - 8.

The same problems will come in data input and output. For example, our test program can save and read data in some text files:

 std::string readstr() { std::ifstream f( "file.txt" ); std::string s; std::getline( f, s ); return s; }

 void writestr( const std::string& s ) { std::ofstream f( "file.txt" ); f.write( s.c_str(), s.size() ); }

Everything will work fine as long as these files are written and read by applications compiled in the same environment. But what will happen if this file writes a Windows application, and reads the application under Linux? .. We get “krakozyabry” :)

How to be in such cases? The general principle of possible solutions is to choose some kind of unified way of representing strings in the program's memory and to do explicit encoding / decoding of strings during input / output. Many developers use UTF-8 encoding in their programs. And this is a very good decision.

But, as I mentioned above, we “repaired the train at full speed,” and we could not break some of the invariants on which our code relied (it was developed taking into account that the string encoding is Windows-1251):

fixed width of characters - random access to a character by its index is possible
there is the possibility of writing string constants in Russian in the code

In UTF-8 encoding, characters can be represented by different numbers of bytes, which is why it does not satisfy the first requirement. The second requirement in the case of UTF-8 is not met, for example, in MSVC 2010, where string constants are encoded in Windows-1251. Therefore, we had to abandon UTF-8, and we decided ... to completely abstract away from the encoding in which the strings are presented, and switched to “wide strings” (wide strings).

This solution almost completely satisfied our requirements:

On almost all UNIX systems, the “wide strings” are represented by UTF-32 encoding, that is, the width of characters in it is fixed and coincides with the size of an element of type wchar_t
On Windows, UTF-16 is used. This encoding is somewhat more complicated, since some characters can be represented by surrogate pairs. But, fortunately, all that is in Windows-1251, on which our Windows-based application was running, is represented by two-byte sequences. Therefore, at the initial stage, we did not support the surrogate pairs at all and made an assumption that under Windows all characters fit into one wchar_t element.
In C ++, you can set "wide" string constants, for example, L "Hello, habr!". In this case, the compiler itself takes care of the conversion of this line from the encoding of the source file to the encoding in which wchar_t is represented on the target system.

In addition, when using "wide lines" we got a number of advantages:

In standard C and C ++ libraries there are many functions and classes for working with “wide strings” - there is no need to write your own analogs of the functions strlen, strstr, classes std :: string, std :: stringstream, etc.
Many third-party libraries support "wide strings" (for example, BOOST)
Most WinAPI can work with “wide strings”

On all the platforms we need, "wide characters" are represented by Unicode. Due to this, our applications are no longer limited to Latin and Cyrillic, they support all languages of the world.

In fact, dealing with encodings was the most difficult part of porting our products. You can tell a lot more about it - let's leave it for the next articles :)

OS file system features

The Windows file system has several differences from the majority of UNIX-like file systems:

It is case-insensitive.
It allows you to use the "\" symbol as a path delimiter.

What does this lead to? You can name your header file “FiLe.H”, and in the code write “#include <myfolder \ file.h>”. On Windows, this code will compile, and on Linux you will get an error that the file named “myfolder \ file.h” was not found.

But, fortunately, to avoid such problems is very simple - it is enough to accept the rules for naming files (for example, to name all the files in lower case) and stick to them, and always use “/” as path delimiters (Windows also supports it).

In order to completely eliminate annoying errors, we hung a simple hook on our git repositories, which checks the compliance of the include directives with these rules.

Also, features of the file system affect the application itself. For example,

 std::string root_path = get_some_path(); std::string path = root_path + '\\' + fname;

If you have code that “glues” the paths through normal string concatenation operations and uses “\” as delimiters, then it will break, since under some OS the separator will be perceived as part of the file name.

Of course, you can use '/', but in Windows it looks ugly, and in general there are no guarantees that there will not be an OS that will use some other separator.

To solve this problem, we use the library boost :: filesystem. It allows you to correctly form the path for the current system:

 boost::filesystem::path root_path = get_some_path(); boost::filesystem::path path = root_path / fname;

Conclusion

Developing cross-platform C ++ software is not a trivial task. It is probably impossible to write a program that will work on various software and hardware platforms without any additional effort. And it is impossible to develop a large program in C ++, which will correctly assemble on any compiler for any OS and for any hardware, despite the fact that C ++ is a cross-platform language. But if you adhere to a number of rules that I briefly stated in the article, then you will be able to write code that will run on all the platforms you need. Yes, and transfer this program under the new OS or hardware will not be so difficult.

Total to write cross-platform code you need:

It is good to know the Standard C ++, to understand what is allowed in it, and what is an extension of a particular compiler or even leads to undefined behavior.
Abandon the use of the system's API in code by encapsulating platform-specific code in some classes or using ready-made cross-platform libraries.
To take into account possible differences in typing, not to be tied to the properties of the basic types, which are not guaranteed by the Standard C ++. To do this, you can use types with fixed dimensions from the standard C ++ library.
Decide on the format of the lines in the program memory. There may be many options. For example, use UTF-8, as is done in many programs, or even go to the "wide" lines, abstracting from the format of the representation of lines at all.
Consider the features of file systems on different operating systems (both in code, in #include directives, and in the logic of the program itself).

Author: Alexey Konovalov

Source: https://habr.com/ru/post/326856/

All Articles