How it all began
Once I had to participate in the development of one small project for scientific calculations, which was developed in the programming language Python. Initially, Python was chosen as a convenient and flexible language for experiments, visualization, rapid prototyping and development of algorithms, but later became the main language of project development. It should be noted that the project was, though not large, but rather technically rich. To ensure the required functionality, the project widely used algorithms of graph theory, mathematical optimization, linear algebra and statistics. Decorators, metaclasses, and introspection tools were also used. The development process had to use third-party math packages and libraries, such as numpy and scipy, as well as many others.
Over time, it became clear that rewriting a project in a compiled language was too time consuming in terms of resources and resources. The speed of work and memory consumption were not critical indicators in this case and were quite acceptable and sufficient. Therefore, it was decided to leave everything as is, and continue to develop and support the project in the Python language. In addition, the documentation for the most part has already been written using
Sphinx .
The project was a library, the functions of which were used in one of the expansion modules in a large software package. The software complex was written in C ++, was a commercial product, had protection with a hardware key and was delivered to customers without providing source codes.
')
Here a new problem immediately became apparent: how to protect the source codes of our Python library? Maybe, otherwise, no one would do it, I certainly would, but some know-how was implemented in the library, and the project leaders did not want these developments to reach the competitors. Since I was one of the performers, I had to attend to this problem. Next, I will try to talk about the main idea, what came of it, and how we managed to hide the Python sources from the eyes.
What do people offer
As it is known, probably, to most developers, Python is an interpretable, dynamic language with rich introspection capabilities. The binary files of the modules * .pyc and * .pyo (byte-code) are easily decompiled, therefore it is impossible to distribute them in their pure form (if we decided not to show the source code for real).
As I think, anyone in my place, at first I decided to search, but what do people do in such cases? The first search queries showed that people do not know what to do and ask about it on stackoverflow and in other places, for example, here’s the
question on stackoverflow . Searching, I came to the conclusion that everywhere they offer several controversial ways:
- To score and not to steam, all the same, who needs it, picks it up;
- Rewrite in compiled language;
- Obfuscating source codes, for example, using one and two ;
- Translate all Python modules into extension modules (*. Pyd) using Cython or Nuitka (as did warsoul - the author of this article );
- Replace opcodes in the Python interpreter source code and distribute your build, as suggested by hodik .
For many reasons, I have discarded all these methods as inappropriate. For example, obfuscation of Python code. Well, what could obfuscation be when the syntax of a language is built on indents, and the language itself is permeated with “sly introspection”? Translating all Python-modules into binary expansion modules was also not possible, since the project, I recall, was technically quite complex using many third-party packages, and it itself consisted of a large number of modules in a multi-level package hierarchy that was tedious to overtake * .pyd, and then catch bugs, getting out of the blue. I didn’t want to mess with replacing opcodes, because I would have to distribute and maintain my own Python interpreter build, and even compile Python modules of all used third-party libraries.
At some point it seemed to me that this idea with the protection of the Python source code was useless, I had to give it all up and convince the management that nothing would come out and do something useful. We give away the * .pyc files, okay, who will understand it? It was not possible to convince, no one wanted to rewrite the library in C ++, and the project needed to be handed over. In the end, still something turned out to do. About this, if still interesting, you can read further.
What did we do
What can best protect any information on digital media from outsiders? I think this is encryption. Armed with this fundamental idea, I decided that the source code should be encrypted, otherwise it should not be. For an outside observer who began to show excessive interest, all this should look like a bunch of incomprehensible files with incomprehensible contents. It is quite an obfuscation, but more advanced than replacing variable names and inserting empty lines.
The course of my thoughts was as follows:
- We encrypt in any way all the sources of our Python library, you can even mix them and change the names of the modules and packages files;
- We write binding in order for the Python interpreter to be able to load and import modules from encrypted text files (decryption, restoration of the package structure and file names, import, etc.);
- “Hide” all this into a binary extension module (* .pyd), so that no one would guess .
The basic idea, I think, is clear - this is a more advanced obfuscation. How to do it? Googling, I came to the conclusion that this is quite realistic and even quite simple. With source code encryption, everything is clear; you can encrypt and / or obfuscate files in a variety of ways, as long as there was “porridge” and “nothing is clear”, and all this should be returned to its original form in an unknown way (in the case of obfuscation). For the example here, I will use the
base64 Python module to “encrypt”. In non-critical cases, you can use the wonderful
obfuscate package.
Python the Importer Protocol
How do we realize the ability to import modules from encrypted files? Fortunately, Python has an import hook system that works on the basis of the Importer Protocol (
PEP 302 ). So we will use this opportunity. To intercept imports, the
sys.meta_path
dictionary is
sys.meta_path
, in which
finder/loader
objects that implement the Importer Protocol should be stored. A survey of this dictionary always occurs until the paths in
sys.path
are checked.
For a minimal implementation of the import protocol, you need to implement two methods:
find_module
and
load_module
. The
find_module
method
find_module
responsible for finding a specific module / package (after all, we only need to intercept the import of our modules, and transfer the rest to the standard mechanism), and the
load_module
method, respectively, loads a specific module only if it was “found” in the
find_module
method.
So, seemingly got to the bottom. You can give a simple example. A minimal example of a class that implements the Importer Protocol suitable for our purposes. It will import base64-encoded modules from the usual package structure (in this case, for simplicity, we simply “encrypted the contents” of the files, but did not change their names or package structure). We believe that the file extensions for our modules will be proudly called ".b64".
How it works? First of all, when creating an instance of a class, information is collected about the modules of our library, which we “encrypt”. Then, when a specific module is loaded, the necessary “encrypted” file is read, “decrypted” and imported using the
imp
module's means already from the “decrypted” text line. How to use this class? Very easy. Literally, one line includes the ability to import the "encrypted" source code of our library, and in fact put a hook on the import:
sys.meta_path.append(Base64Importer(root_pkg_path))
where
root_pkg_path
absolute or relative path to the root package of our library. At the same time, removing this line, we can use the usual source code, if they are available. Everything happens absolutely transparent, and all changes occur in one place.
That's all, from this moment the import of modules from our library is carried out with interception and "decoding". Our hook will twitch on any invocation of the import instruction, and if modules of our library are imported, the hook will process and load them, the rest of the imports will be processed as standard. What we needed for more advanced obfuscation. The presented code of the importer and the installation of the hook can be put in the * .pyd file and hope that no one will disassemble it in the hope of understanding that we are there. In a real project, you can use real encryption, including using a hardware key, which should increase the reliability of this method. Also, changing the file names and package structure can be useful for more obfuscation.
Conclusion
As a conclusion, I want to say that I am an opponent to hide the sources, which can not be simply taken and hidden. In this case, I do not dare to discuss the ethical side of the issue and the usefulness / usefulness of hiding Python sources. Here I just presented a method for how to do this and get some kind of result. Naturally, this is not a panacea. Python code is really impossible to hide completely and from everyone. Module code can always be obtained using introspection by the built-in capabilities of the language after they are loaded, for example, from the variable
sys.modules
. But this is not so obvious, as if the source code were open initially.
It is possible that everything that is written here is not worth a jigger - long-known truths or the delirium of a madman. But if the above described may be useful to someone, I will be glad. For me personally, this experience was useful, if only because it allowed me to better understand the powerful and flexible system of loading and importing modules and the Importer Protocol. About the very things that are often not required by developers to write programs in Python. It is always interesting to learn something new.
Thanks for attention.
UPD 08/14/2013 :
At the request of
tangro, he made a minimal project that demonstrates the described method, but without real encryption (only some algorithms of reversible transformation were applied).
Download the zip-archive by the
link . Need Python 2.7 x86 under Windows. You need to run the script "test_main.py".
UPD 2:
And a more interesting example in which some calculations are made. Here, all imports and function calls from encrypted modules are hidden in a binary module. Download the archive by the
link .