We are friends of Python 3 with MS Visual C ++. Build bridge in Boost.Python with automatic transcoding

All good {daytime}!

Today it's time to tell you about the fundamental problem of transcoding when interacting with a project compiled on MS Visual C ++ on the Windows platform and the most pleasant scripting for C ++, thanks to the Boost.Python library, written for Python itself.

After all, you want to use good scripting on the latest version of Python 3.x for your C ++ application on Windows OS, or you want to use the maximum accelerated portion of your module code written in C ++ for your Python application. In both cases, if you know both languages at least well, you should read it.

I will not bore you with many hours of calculations about the problem of recoding the text in principle. We all know that this problem is not new, it is solved differently everywhere, in most cases it does, that is, it is shifted onto the programmer’s shoulders.
')
In Python, starting with version 3.0, it was decided to consider only the text itself as strings. No matter how the text itself is encoded, but encoded in Unicode, the very concept of a string is forever divorced from its encoding. That is, there is no way to understand which number corresponds to a character in a string, other than to encode it into an array of bytes, specifying the encoding.

"!".encode('cp1251')

The example above shows that the “Hello!” Line itself will remain as you have typed it, regardless of whether you look at it in Russia, the United States or in China, on Windows, Linux or MacOS, it will remain the “Hello!” Line. Having decoded it into a byte array using the str.encode (encoding) string method, we always get the same value of the byte array elements, regardless of where in the world we are and what platform we use. And that's great!
But back to Earth. There is such a Windows OS ...

The whole problem lies in the wonderful MS Visual Studio development environment. And most of all, it is remarkable because all the lines in C ++ are guaranteed to be in the encoding of the Windows code page. That is, for Russia, all lines will always be in ' cp1251 '. And everything would be fine, but this encoding is not suitable for output to a web page, save to XML, output to an international database and others. The proposed Microsoft version of the lines of the form L "Hello" is acceptable a little more, but we know how wonderful it is in C ++ to work with such strings. In addition, we will proceed from the fact that the project has already come to us with a bunch of lines in the form of cp1251. Gigabytes of code that work with std :: string and char * and work with them perfectly: quickly and efficiently.

If you're coming from Python in C ++, just remember that Python strings are perfectly converted to char * using Python internal memory, since all strings in Python 3.x are at least in UTF-8 already stored and carefully monitored by GC and the reference count . Therefore, again: do not need this UCS-2 from Microsoft as Unicode, use regular strings. Well, besides, remember that the local database for Russia of your company will not thank you for the doubled data size when switching from WIN-1251 to UTF-8, as they are probably full of Cyrillic code.
In general, the problem is indicated.

Now the solution.

You probably already have the latest version of Python 3.x (at the moment it is Python 3.3), if not already, put the last one from here: www.python.org/download/releases
Also, you probably have MS Visual Studio (at the moment the last one is VS2012, but everything mentioned below will be true for the previous version of VS2010).
To bundle your C ++ classes with Python, you need a Boost.Python library. It is already included in the almost standard Boost library: www.boost.org (currently the latest version is 1.52, but it’s checked and right up to 1.44).
Unfortunately, unlike everything else, Boost.Python needs to be built. If you haven’t compiled it with the other libraries yet, only Boost.Python can be built with the following Boost.Build command (in older versions via bjam):
b2 --with-python --build-type=complete
If you download Python 3.x for x64, then you must also specify address-model = 64.
More information in the Boost.Build documentation.
As a result, in {BoostDir} \ stage \ lib \ you should have a bunch of boost-python * type libraries. We are about to need them! ..

So actually reproduce the problem. We write a simple class:

  class MY_EXPORT Search { public: static string That( const string& name ); };

With this implementation of its only method:

  string Search::That( const string& name ) { if( name == " !" ) return ""; else throw runtime_error( "   !" ); }

In reality, everything is much more complicated: you probably have a record from the database with Cyrillic fields, and the values themselves are also Cyrillic, and everything is encoded in Windows-1251. But we will have enough of this test example to debug. Here there is a string conversion back and forth from C ++ and even passing exceptions to Python.

Using Boost.Python we wrap our small library:

 BOOST_PYTHON_MODULE( my ) { class_<Search>( "Search" ) .def( "That", &Search::That ) .staticmethod( "That" ) ; }

Do not forget about the dependence on the Boost and the source library in the project settings!
The rendered library is renamed to my.pyd (yes, just change the extension).

We try to work with it from Python. You can directly from the console, if you don’t have an IDE at hand like Eclipse + PyDev, just import and use in two lines:

 import my my.Search.That( " !" )

Do not forget that this is still a .dll and it probably needs the .dll of the source library with the Search class, besides the new wrapper library will need the .dll Boost.Python of the corresponding assembly from {BoostDir} \ stage \ lib \ , for example, for MS VS2012 and Boost 1.52 for Debug build (Multi-thread DLL) is boost_python-vc110-mt-gd-1_52.dll .
If it is unclear what your .dll is missing, look at its dependencies using the same Dependency Walker: www.dependencywalker.com - just open depends.exe your .dll with the wrapper library.
So, you managed to import the my library and execute my.Search.That( " !" )

If everything is good, you will see the exception that came from C ++ with empty text. That is not only that we did not fall into the necessary if branch, so also the exception text recoded not in the way we sent it!

If you join the Python process through " Attach to process " from MS Visual Studio, you will see that in Search::That( const string& name ) name comes to UTF-8 . Boost.Python does not know in what encoding to give the string, so it gives by default to UTF-8.
Of course, our code in Visual Studio is completely focused on Windows-1251, therefore, it also cannot understand that “PC, Ї!” Is actually “It's Me!”. We get the conversation of the blind with the deaf. For the same reason, the exception text coming from C ++ in Python is not visible.

Well, we will fix it.

The first thing that comes to mind is to inherit / wrap the original class in another, which can transcode.
Yeah, now let's look at the rest of the classes, lonely shuffling their legs while waiting for their turn. Are you ready to spend half your life? Even if this is not the case, the very first performance measurements will show how wrong you are when wrapping children. Well, in the end you will have hellish problems when you try to get the wrapped classes back to C ++ objects. You will have them, believe me! We build a bridge on which we will go in both directions, and class wrappers should directly refer to the methods and properties of the desired class. See extract <T &> (obj) from boost :: python on the C ++ side.

We analyze everything that is done in Boost.Python when the string travels between C ++ and Python. We see several wonderful places that use the PyUnicode_AsString and PyUnicode_FromString functions . A little knowing the native Python API for pure C (if we don’t know, then reading the documentation) we understand that this is the root of all evil. Boost.Python perfectly distinguishes between Python 2 and 3 versions, but it’s impossible to understand that a Unicode string must be converted to a string encoded by the codepage of the file system, but it provides for this alternative functions that it is proposed to use on its own:

PyUnicode_DecodeFSDefault - recodes a string in the file system encoding (in our case, this is just Windows-1251) and returns a ready-made object of the string, perfectly suited instead of PyUnicode_FromString in {BoostDir} \ libs \ python \ src \ in the str.cpp and converter \ builtin_converters files .cpp .

PyUnicode_DecodeFSDefaultAndSize is the same, but with the indication of the size of the string. Suitable as a replacement for the similar PyUnicode_FromStringAndSize in the same files.

PyUnicode_EncodeFSDefault - on the contrary, it accepts a string object from python and recodes it, returns the result as an array of bytes ( PyBytes object), from the byte array, after that, you can pull out the usual sish line with the PyBytes_AsString function. Required for the inverse transform instead of the PyUnicode_AsUTF8String function, and paired
PyBytes_AsString (PyUnicode_EncodeFSDefault (obj)) replaces the _PyUnicode_AsString (obj ) macro, which does virtually the same, but without conversion.

So, we are armed to the teeth, we know the enemy by sight, it remains only to find and neutralize him!

We need files that use PyUnicode_ * in {BoostDir} \ libs \ python \ src \ code and header files inside {BoostDir} \ boost \ python \ , besides, I’ll reveal the secret right away, we will need to fix the exceptions in the error.cpp file.

In general, the following list:
builtin_converters.cpp - edit string conversions from Python to C ++ and back
builtin_converters.hpp - correct the conversion macro in the header file
str.cpp - we rule the wrapper in C ++ above the Python str class (the usual python string in C ++).
errors.cpp - correct the transmission of exception text from C ++ to Python

There are few changes, they are dotted, all are listed below, patches and reports on changes are in the archive attached to the article, as a rule, all changes do not exceed one line of code, more often even one call instruction, totally 13 in 4 files. You are not superstitious, no? ..

After all edits, we collect only Boost.Python with the command already mentioned above:
b2 --with-python --build-type=complete
(You must add address-model=64 if the build is for x64, that is, both your project and Python 3.x installed on your machine are compiled for a 64-bit addressing architecture.)

After Boost.Python is compiled, rebuild your project with the updated library, updated not only .lib and .dll, but also one header file.
Do not forget to replace the old and dull. Dll on the newly collected. You surely will not forget to copy them, right?

The moment of truth!

 import my res = my.Search.That( ' !' ) print( res )

All the same code now returns what was expected: the string 'I'.
Quite Cyrillic, very Unicode, if Python 3 considers this object as a string!
Now let's check how our exception comes:

 import my res = my.Search.That( ' !' ) print( res ) try: my.Search.That( ' - !' ) except RuntimeError as e: print( e )

Our exception arrives remarkably, with the necessary text, in the form of RuntimeError - the standard Python exception.
As a bonus, we got the fact that on the C ++ side, when creating objects boost :: python :: str, we immediately convert them to Unicode, which will help us a lot when we on the C ++ side want some attribute of a Python object called Cyrillic:

 object my = import( "my" ); object func = my.attr( str("") ) int res = extract<int>( func( x * x ) );

Now in MS Visual C ++ there will be no problems with such code. Everything is fine vycepitsya, will call and return everything you need.
Well, since we are talking about calling from C ++ code in Python, it is worth mentioning how to catch exceptions from there.
All exceptions from Python at the C ++ level will be caught by the type error_already_set & all from the same boost :: python. Pulling out the text, type and stack of exceptions is not complicated and is described in detail here: wiki.python.org/moin/boost.python/EmbeddingPython - section Extracting Python Exceptions. In the overwhelming majority of cases, nothing more than to pick up the text of the exception and is not needed, unless of course you have invented your own specific logic of exceptions. But in this case, you'd better write your exceptions translator, and that’s another story ...

Total

We made friends the native MS Visual C ++ code with the usual Python code using a small patch Boost.Python, without actually changing the code, simply replacing in some places the call of some functions of the python API with others that perform additional transcoding. Since everything is done through the Python API itself, he will take care of the memory allocated for objects, no std :: string and other horrors to access Heap through the wonderful mutexes that Microsoft so diligently incorporated into the new mechanism of its standard library. Not! Nothing! Python will do everything for us, we just had to help him a little.
Mere mortals can still write code in Visual Studio without thinking about encodings. Perhaps not even knowing about them. In principle, a narrow specialist in the field of the same transport part (protocols, data packets, etc.) is not so necessary to know about it.
Particularly inquisitive can measure the loss of recoding, they certainly are. According to my measurements, they are so insignificant that once rewriting the code of a very slow generation of a web page from C ++ to one join + format in Python, it accelerated it by almost 10%. This is taking into account the recoding with the above edits. Accordingly, you can imagine the insignificance of such losses if the code in C ++ just collected a fairly large string (even with a preliminary reserve).
In terms of stability, for at least six months, at least as a shell built on these changes, spins safely on work sites (although the Boost version is much older than the current one). Today, everything is recoded steadily, raises no objections, and has not caused.

Promised archive with changes

Here are collected reports and patches for changes in the files in the library Boost.Python:
www.2shared.com/file/NFvkxMzL/habr_py3_cxx.html

Also included is a small archive bonus with a test project (compiled under x64):
www.2shared.com/file/FRboyHQv/pywrap.html

Links to useful

Link to Python documentation 3. C-API section of transcoding from file system codepage and back:
docs.python.org/3/c-api/unicode.html?highlight=pyunicode#file-system-encoding

Link to Boost.Python documentation:
www.boost.org/doc/libs/1_52_0/libs/python/doc

Source: https://habr.com/ru/post/161931/

All Articles