📜 ⬆️ ⬇️

Creating zip modules in python

Prehistory


At a certain stage of development of Acronis technology, it was decided to explore the possibility of distributing the python3 interpreter of our own assembly as part of our products, while expanding it with our own modules that provide access to the infrastructure of these products. This post is one of the results of research in this direction.

First of all, we wanted to have a limited compact set of finite distributable modules. However, the public Python build distributed via python.org does not have to this, the standard library alone, which is an integral part of the language itself, consists of more than a thousand py-files. That is why we immediately noticed such an interesting feature of the interpreter, such as the ability to import modules stored in zip archives, when the entire set of python sources, belonging to one or several modules, is packed into a zip archive and distributed in one zip file.

Looking back, it is safe to say that support for working with zip modules in python is a powerful and convenient thing. And it works, and it works well. After a series of experiments with zip-modules, imbued with the spirit of zip-poking, we got so into the taste that the entire standard python library (the script part of it) was also packed into a separate zip-file.
')

Start


To begin with, we will create a test environment that is as simple as possible, but at the same time sufficient to demonstrate all the intended features of the discussed functionality. Environment will be screw, so it turned out to be more convenient for me at the moment. For those who want to try the Linux examples given here, I’ll just note that there shouldn’t be any fundamental differences, the only thing required is an installed python3, either through the package manager of your Linux distribution, or through the good old configure / make / make install.

Simple demo modules, which we will pack in zip, I originally had in d: \ habr \ lib:

Since, among other things, I wanted to demonstrate the ability to package just a few modules into one zip file, here I created two different types of modules, the first say_hello module consists of a single say_hello.py file with the say_hello() function say_hello() in it, the second my_sysinfo module my_sysinfo made slightly more complicated - as a directory with the file __init__.py , containing in the import list the function print_sysinfo . Looking ahead, I will immediately say that this function, among other summary information such as sys.version also prints a stack of its own call specifically for revealing the features of zip-hosting.

We check that everything works unpacked:
 c:\Python33\python.exe 

 Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.path.insert(0,'d:\\habr\\lib') >>> import say_hello >>> say_hello.say_hello() Hello python world. >>> import my_sysinfo >>> my_sysinfo.print_sysinfo() -------------------------------------------------------------------------------- 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] -------------------------------------------------------------------------------- File "<stdin>", line 1, in <module> File "d:\habr\lib\my_sysinfo\sysinfo.py", line 9, in print_sysinfo traceback.print_stack() -------------------------------------------------------------------------------- 


Zip packaging


In the very package of the original py-files in zip there are no secrets. To do this, you can use any zip-archiver available at your fingertips, or you can directly pack a python script using the functionality from the standard zipfile module. A little later, I will provide the code for a simple packaging script that I called mkpyzip.py and put it in the d: \ habr \ tools folder.

We package the above modules into a zip file d: \ habr \ output \ mybundle.zip with this script:
 :\Python33\python.exe d:\habr\tools\mkpyzip.py --src d:\habr\lib\my_sysinfo d:\habr\lib\say_hello.py --out d:\habr\output\mybundle.zip ::: d:\habr\lib\my_sysinfo\__init__.py >>> mybundle.zip/my_sysinfo/__init__.py ::: d:\habr\lib\my_sysinfo\sysinfo.py >>> mybundle.zip/my_sysinfo/sysinfo.py ::: d:\habr\lib\say_hello.py >>> mybundle.zip/say_hello.py 
This script, among other things, added a detailed conclusion about which file and under what name it is packed into a zip-archive.

We check that everything works being packaged in such a zip-archive:
 c:\Python33\python.exe 

 Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.path.insert(0, 'd:\\habr\\output\\mybundle.zip') >>> import say_hello >>> say_hello.say_hello() Hello python world. >>> import my_sysinfo >>> my_sysinfo.print_sysinfo() -------------------------------------------------------------------------------- 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] -------------------------------------------------------------------------------- File "<stdin>", line 1, in <module> File "d:\habr\output\mybundle.zip\my_sysinfo\sysinfo.py", line 9, in print_sysinfo traceback.print_stack() -------------------------------------------------------------------------------- 
From the output it can be seen that everything works as expected, being packaged in a zip archive, in particular the stack listing from our function my_sysinfo.print_sysinfo() shows that the code of the function being called is inside our zip file - d: \ habr \ output \ mybundle .zip \ my_sysinfo \ sysinfo.py

Bytecode generation when packing in zip


And now is the time to recall such a well-known feature of the interpreter, such as generating a bytecode when importing a module, or loading and executing bytecode generated earlier, if one is valid at the time of import. In the case of modules packaged in zip, things are a little different. For zip-modules, the byte-code must be generated and packed into a zip-file in advance, otherwise the interpreter after each restart when importing any module from the zip-file will generate for it a byte-code in memory again. Well, in our mkpyzip.py script, the bytecode generation is already provided, just add the option --mkpyc and --mkpyc zip file:
 c:\Python33\python.exe d:\habr\tools\mkpyzip.py --mkpyc --src d:\habr\lib\my_sysinfo d:\habr\lib\say_hello.py --out d:\habr\output\mybundle.zip ::: d:\habr\lib\my_sysinfo\__init__.py >>> mybundle.zip/my_sysinfo/__init__.py ::: mkpyc for: d:\habr\lib\my_sysinfo\__init__.py >>> mybundle.zip/my_sysinfo/__init__.pyc ::: d:\habr\lib\my_sysinfo\sysinfo.py >>> mybundle.zip/my_sysinfo/sysinfo.py ::: mkpyc for: d:\habr\lib\my_sysinfo\sysinfo.py >>> mybundle.zip/my_sysinfo/sysinfo.pyc ::: d:\habr\lib\say_hello.py >>> mybundle.zip/say_hello.py ::: mkpyc for: d:\habr\lib\say_hello.py >>> mybundle.zip/say_hello.pyc 

Now that the basic aspects of packaging python modules in a zip file are revealed, it's time to bring the code for the utility mkpyzip.py itself. Immediately, I note that there is nothing special in this script, and the prototype for generating bytecode was borrowed from the standard library of the python language (to search for this prototype, it is enough to search for the keyword wr_long).
mkpyzip.py
 import argparse import imp import io import marshal import os import os.path import zipfile def compile_file(filename, codename, out): def wr_long(f, x): f.write(bytes([x & 0xff, (x >> 8) & 0xff, (x >> 16) & 0xff, (x >> 24) & 0xff])) with io.open(filename, mode='rt', encoding='utf8') as f: source = f.read() ast = compile(source, codename, 'exec', optimize=1) st = os.fstat(f.fileno()) timestamp = int(st.st_mtime) size = st.st_size & 0xFFFFFFFF out.write(b'\0\0\0\0') wr_long(out, timestamp) wr_long(out, size) marshal.dump(ast, out) out.flush() out.seek(0, 0) out.write(imp.get_magic()) def compile_in_memory(source, codename): with io.BytesIO() as fc: compile_file(source, codename, fc) return fc.getvalue() def make_module_catalog(src): root_path = os.path.abspath(os.path.normpath(src)) root_arcname = os.path.basename(root_path) if not os.path.isdir(root_path): return [(root_path, root_arcname)] catalog = [] subdirs = [(root_path, root_arcname)] while subdirs: idx = len(subdirs) - 1 subdir_path, subdir_archname = subdirs[idx] del subdirs[idx] for item in sorted(os.listdir(subdir_path)): if item == '__pycache__' or item.endswith('.pyc'): continue item_path = os.path.join(subdir_path, item) item_arcname = '/'.join([subdir_archname, item]) if os.path.isdir(item_path): subdirs.append((item_path, item_arcname)) else: catalog.append((item_path, item_arcname)) return catalog def mk_pyzip(sources, outzip, mkpyc=False): zipfilename = os.path.abspath(os.path.normpath(outzip)) display_zipname = os.path.basename(zipfilename) with zipfile.ZipFile(zipfilename, "w", zipfile.ZIP_DEFLATED) as fzip: for src in sources: catalog = make_module_catalog(src) for entry in catalog: fname, arcname = entry[0], entry[1] fzip.write(fname, arcname) print("::: {} >>> {}/{}".format(fname, display_zipname, arcname)) if mkpyc and arcname.endswith('.py'): bytes = compile_in_memory(fname, arcname) pyc_name = ''.join([os.path.splitext(arcname)[0], '.pyc']) fzip.writestr(pyc_name, bytes) print("::: mkpyc for: {} >>> {}/{}".format(fname, display_zipname, pyc_name)) def main(): parser = argparse.ArgumentParser() parser.add_argument('--src', nargs='+', required=True) parser.add_argument('--out', required=True) parser.add_argument('--mkpyc', action='store_true') args = parser.parse_args() mk_pyzip(args.src, args.out, args.mkpyc) if __name__ == '__main__': main() 


Bytecode validity


I will also add a few words about how to make sure that the byte-code generated by us is valid and the interpreter normally picks it up when importing the module without attempting to regenerate the new byte-code in memory.
To do this, simply print out the attribute __file__ , for the imported module say_hello .
 c:\Python33\python.exe 

 Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.path.insert(0,'d:\\habr\\output\\mybundle.zip') >>> import say_hello >>> say_hello.__file__ 'd:\\habr\\output\\mybundle.zip\\say_hello.pyc' 
The fact that the __file__ attribute of the loaded module points to the pyc file generated by us is sufficient proof of the validity of our bytecode.

With this I could probably, with a clear conscience, finish my introductory overview of zip-peking in python, if not for one “but” ...

Surprises


One of my colleagues once picked up Eclipse and, with the help of the well-known add-on, PyDev tried to debug a python script written by him that, among other things, used functionality from the python modules of the ziped-up technology just described.

The main unpleasant surprise was that PyDev refused to debug similar modules at all. Strongly interested in this trouble, we began to look for the source of the problem. Looking back now, we can say that, according to our personal conviction, PyDev simply does not have enough quality support for debugging zip modules.

However, at the time of the study, the nuances of debugging for PyDev were immediately excluded from consideration, since The pdb debugger built into the python also gave out information about the call stack of a rather dubious type. Moreover, the information was doubtful only in the case when the zip-archive along with the source py-files also contained pyc-files with byte-code. In the case of a zip-archive with only py-files, the automatically generated byte-code was clearly different, and debugging in pdb gave correct information that did not cause censures. Except for debugging, everything worked as expected. Nevertheless, there was definitely something wrong with our bytecode. And this was clearly signaled to us by pdb.

Now that we have found the source of the problem, we don’t feel like going into the details of debugging python code under pdb. To clarify the cause of the problem, let's just re-print the call stack from zazipovanogo byte-code, using the function previously written print_sysinfo () from the module my_sysinfo.
 c:\Python33\python.exe 

 Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.path.insert(0,'d:\\habr\\output\\mybundle.zip') >>> import my_sysinfo >>> my_sysinfo.print_sysinfo() -------------------------------------------------------------------------------- 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] -------------------------------------------------------------------------------- File "<stdin>", line 1, in <module> File "my_sysinfo/sysinfo.py", line 9, in print_sysinfo traceback.print_stack() -------------------------------------------------------------------------------- 

Now let's compare this output with the one that we received earlier , even before we started ziping our own byte code. The key difference here is in the file paths in the stack frame.

Without a bytecode in the zip file, we had an output of the form:
File "stdin", line 1, in "module"
File " d: \ habr \ output \ mybundle.zip \ my_sysinfo \ sysinfo.py ", line 9, in print_sysinfo
traceback.print_stack ()

And after adding a bytecode, it took the form:
File "stdin", line 1, in "module"
File " my_sysinfo / sysinfo.py ", line 9, in print_sysinfo
traceback.print_stack ()

From the output it becomes clearly visible that when adding a bytecode to the zip archive in the call stack, the path to the file from the absolute path becomes relative, and relative to the root of the zip archive. Here, an attentive reader can immediately argue that we ourselves have generated such a byte code by submitting this relative path to the builtin compile function in the mkpyzip.py utility. But if you think about it a little deeper, then it becomes clear that the full path is not appropriate in this case, because our final goal is to collect the zip-archive on one machine, to be able to use it on another, maybe even on a machine with another operating system.

None of us at that time was intimately familiar with the implementation of loading zip modules into the interpreter, so it was impossible to give an unambiguous answer to the question of what the root of the problem is: whether we unknowingly miss something when generating bytecode, whether the zip module loader itself in python behaves incorrectly when it is loaded.

As a result, it was decided to seek advice from the developers of the Python language via python-dev@python.org . The only thing they advised us at that time was to get a bug on this topic so that the context of the described problem would not be lost. Bug we brought bugs.python.org/issue18307 and waited. After about a month of waiting and doing other equally pressing problems, our patience quietly ended, and the python33.dll got into the debugger.

As a result, we confirmed our suspicions and with certainty we can say that it is the Si-shnaya implementation of the zip-module loader in python that behaves incorrectly when loading bytecode. More precisely, the case described here, which requires automatic normalization of paths in bytecode when loading it from zip files, was simply not implemented. As a result, within the framework of the same bug, we proposed a patch correcting this problem and bringing the paths to the files in the call stacks to absolute appearance.

Now, about half a year later, this bug on bugs.python.org remains open. Apparently, because the zip-modules in the python feature, although powerful, are rarely used, especially the case with the byte-code inside the zip-archive. However, having our own repository with python sources (which we try to keep as close as possible to the public original as much as possible), we simply commit this patch to ourselves.

Conclusion


Modules in python, being packed in a zip-archive, work as well as in the unpacked form. The only thing you need to be ready for is that after packaging you may encounter certain difficulties with their debugging both through Eclipse + PyDev and through other IDEs, in which debugging is also based on PyDev. However, in certain situations, the ability to have a compact set of binary production modules may be much more important than easy debugging of the python code in the IDE.

PS We invented setuptools / eggs? Not.


Zip-modules in python are completely independent and self-sufficient functionality incorporated into the core of the language interpreter itself. setuptools / eggs is only the most widely known version of its use.

Source: https://habr.com/ru/post/208378/


All Articles