Python threading or GIL us almost no hindrance

Probably everyone who has ever been interested in Python is aware of GIL - its both a strong and weak point.
Without interfering with single-threaded scripts to work, he puts a lot of sticks in the wheels when multi-threaded work on CPU-bound tasks (when threads are running, and not hanging alternately in anticipation of I / O, etc.).
The details are well described in the translation two years ago . We cannot overcome GIL in the official Python assembly for true thread parallelization, but we can go another way - to prevent the system from transferring Python threads between kernels. In general, the post of the series, "if you do not need, but really want" :)
If you know about processor / cpu affinity, used ctypes and pywin32, then nothing new will happen.

How it all began

Take a simple code (almost as in the translation article):

cnt = 100000000 trying = 2 def count(): n = cnt while n>0: n-=1 def test1(): count() count() def test2(): t1 = Thread(target=count) t1.start() t2 = Thread(target=count) t2.start() t1.join(); t2.join() seq1 = timeit.timeit( 'test1()', 'from __main__ import test1', number=trying )/trying print seq1 par1 = timeit.timeit( 'test2()', 'from __main__ import test2', number=trying )/trying print par1

Run on python 2.6.5 (ubuntu 10.04 x64, i5 750):

 10.41
 13.25

And in python 2.7.2 (win7 x64, i5 750):

 19.25
 27.41

Immediately discard that the win-version is clearly slower. In both cases, a significant slowdown in the parallel version is seen.
')

If you really want, you can

GIL in any case will not allow the multithreaded version to run faster than the linear one. However, if the implementation of some functionality is simplified when introducing threading into the code, then it is worthwhile to at least try to reduce this lag if possible.
When a multi-threaded application is running, the OS can arbitrarily “transfer” different threads between cores. And when two (or more) threads of the same python process simultaneously try to capture GIL, the brakes start. The transfer is also performed for a single-threaded program, but there it does not affect the speed.

Accordingly, in order for threads to take over GIL in turn, you can limit the python process to a single core. And it will help us in this CPU Affinity Mask, allowing in the format of bit flags to indicate which cores / processors are allowed to run the program.

On different operating systems, this operation is performed by various means, but for now let's consider Ubuntu Linux and WinXP +. FreeBSD 8.2 on Intel Xeon has also been studied, but this will remain outside the article.

And how many options do we have?

Before choosing a kernel, you need to decide how many of them we have at our disposal. It’s worth dancing on platform features: multiprocessing.cpu_count () in python 2.6+, os.sysconf ('SC_NPROCESSORS_ONLN') by POSIX, etc. An example of the definition can be found here .

Directly to work with the processor affinity were selected:

Linux: sched_setaffinity / sched_getaffinity from libc
Windows: kernel32 SetProcessAffinityMask / GetProcessAffinityMask
POSIX as an oftopik: pthread_setaffinity_np / pthread_getaffinity_np

Linux ubuntu

To reach libc we will use the ctypes module. To load the required library, use ctypes.CDLL:

 libc = ctypes.CDLL( 'libc.so.6' ) libc.sched_setaffinity #

Everything would be fine, but there are two points:

The hard assignment of the name libc.so.6 is not portable, and the file libc.so, which should be a symlink to the real version, is made a text file on Debian / Ubuntu.
At the moment, a crutch has been made in the form of searching for all files whose names begin with “libc.so” and an attempt to load them with OSError processing. Loaded - this is our library.
If someone knows the best and universal solution - I will see it once in the comments or in the PM.
Specifying a function name is not enough. We also need the number of parameters and their types. To do this, use the task “magic” attribute argtypes for the functions we need.

Our functions:

 int sched_setaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask); int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);

pid_t is an int, cpu_set_t is a structure from a single field with a size of 1024 bits (that is, it is possible to work with 1024 cores / processors).
We use cpusetsize to not work immediately with all kernels and assume that cpu_set_t is unsigned long. In general, ctypes.Arrays should be used, but this is beyond the scope of the article.
It is also worth noting that the mask is passed as a pointer, i.e. ctypes.POINTER (<type of value itself>).
After the correspondence of types C and ctypes we get:

 __setaffinity = _libc.sched_setaffinity __setaffinity.argtypes = [ctypes.c_int, ctypes.c_size_t, ctypes.POINTER(ctypes.c_size_t)] __getaffinity = _libc.sched_getaffinity __getaffinity.argtypes = [ctypes.c_int, ctypes.c_size_t, ctypes.POINTER(ctypes.c_size_t)]

After specifying argtypes, ctypes follows the types of values passed. So that the module does not swear, but does its work, we correctly indicate the values when calling:

 def get_affinity(pid=0): mask = ctypes.c_ulong(0) #   c_ulong_size = ctypes.sizeof(ctypes.c_ulong) #     32/64  if __getaffinity(pid, c_ulong_size, mask) < 0: raise OSError return mask.value #  ctypes.c_ulong => python int def set_affinity(pid=0, mask=1): mask = ctypes.c_ulong(mask) c_ulong_size = ctypes.sizeof(ctypes.c_ulong) if __setaffinity(pid, c_ulong_size, mask) < 0: raise OSError return

As you can see, ctypes itself implicitly figured out the pointer. It is also worth noting that the call with pid = 0 is performed on the current process.

Windows XP +

The documentation for the functions we need indicated:

 Minimum supported client - Windows XP
 Minimum supported server - Windows Server 2003
 DLL - Kernel32.dll

Now we know when this will work and which library needs to be loaded.

We do by analogy with the Linux version. Take the headlines:

 BOOL WINAPI SetProcessAffinityMask( __in HANDLE hProcess, __in DWORD_PTR dwProcessAffinityMask ); BOOL WINAPI GetProcessAffinityMask( __in HANDLE hProcess, __out PDWORD_PTR lpProcessAffinityMask, __out PDWORD_PTR lpSystemAffinityMask );

As a HANDLE, ctypes.c_uint will suit us perfectly, but with out types of parameters you need to be careful:
DWORD_PTR is all the same ctypes.c_uint, and P DWORD_PTR is already ctypes.POINTER (ctypes.c_uint).
Total we get:

 __setaffinity = ctypes.windll.kernel32.SetProcessAffinityMask __setaffinity.argtypes = [ctypes.c_uint, ctypes.c_uint] __getaffinity = ctypes.windll.kernel32.GetProcessAffinityMask __getaffinity.argtypes = [ctypes.c_uint, ctypes.POINTER(ctypes.c_uint), ctypes.POINTER(ctypes.c_uint)]

And it seems that we will do it there and it will work:

 def get_affinity(pid=0): mask_proc = ctypes.c_uint(0) mask_sys = ctypes.c_uint(0) if not __getaffinity(pid, mask_proc, mask_sys): raise ValueError return mask_proc.value def set_affinity(pid=0, mask=1): mask_proc = ctypes.c_uint(mask) res = __setaffinity(pid, mask_proc) if not res: raise OSError return

But alas . Functions accept not the pid, but the HANDLE process. You still need to get it. To do this, we use the OpenProcess function and the CloseHandle “well” for it:

 PROCESS_SET_INFORMATION = 512 PROCESS_QUERY_INFORMATION = 1024 __close_handle = ctypes.windll.kernel32.CloseHandle def __open_process(pid, ro=True): if not pid: pid = os.getpid() access = PROCESS_QUERY_INFORMATION if not ro: access |= PROCESS_SET_INFORMATION hProc = ctypes.windll.kernel32.OpenProcess(access, 0, pid) if not hProc: raise OSError return hProc

If you do not go into details, then we simply get the HANDLE of the process we need with access to read the parameters, and with ro = False and change them. This is written in the documentation for SetProcessAffinityMask and GetProcessAffinityMask:

 SetProcessAffinityMask:
 hProcess [in]
 It is a mask to be set.  This handle must have the PROCESS_SET_INFORMATION access right.

 GetProcessAffinityMask:
 hProcess [in]
 A handle is required.
 Windows Server 2003 and Windows XP: The PROCESS_QUERY_INFORMATION access right.

So no Monte Carlo method :)

We rewrite our get_affinity and set_affinity with the changes:

 def get_affinity(pid=0): hProc = __open_process(pid) mask_proc = ctypes.c_uint(0) mask_sys = ctypes.c_uint(0) if not __getaffinity(hProc, mask_proc, mask_sys): raise ValueError __close_handle(hProc) return mask_proc.value def set_affinity(pid=0, mask=1): hProc = __open_process(pid, ro=False) mask_proc = ctypes.c_uint(mask) res = __setaffinity(hProc, mask_proc) __close_handle(hProc) if not res: raise OSError return

WindowsXP + for the lazy

To reduce the amount of code for Win-implementation a little, you can install the pywin32 module. It will save us from having to set constants and deal with libraries and call parameters. Our code above might look something like this:

 import win32process, win32con, win32api, win32security import os def __open_process(pid, ro=True): if not pid: pid = os.getpid() access = win32con.PROCESS_QUERY_INFORMATION if not ro: access |= win32con.PROCESS_SET_INFORMATION hProc = win32api.OpenProcess(access, 0, pid) if not hProc: raise OSError return hProc def get_affinity(pid=0): hProc = __open_process(pid) mask, mask_sys = win32process.GetProcessAffinityMask(hProc) win32api.CloseHandle(hProc) return mask def set_affinity(pid=0, mask=1): try: hProc = __open_process(pid, ro=False) mask_old, mask_sys_old = win32process.GetProcessAffinityMask(hProc) res = win32process.SetProcessAffinityMask(hProc, mask) win32api.CloseHandle(hProc) if res: raise OSError except win32process.error as e: raise ValueError, e return mask_old

Briefly, of course, but this is a third-party module.

And what is the result?

If you put it all together and add one more to our initial tests:

 def test3(): cpuinfo.affinity.set_affinity(0,1) #     (pid=0) affinity   . test2() par2 = timeit.timeit( 'test3()', 'from __main__ import test3', number=trying )/trying print par2

then the results will be as follows:

 Linux:
 test1: 10.41 |  102.89
 test2: 13.25 |  135.29
 test3: 10.45 |  104.51

 Windows:
 test1: 19.25 |  191.97
 test2: 27.41 |  269.78
 test3: 19.52 |  196.17

The numbers in the second column are the same tests, but with cnt 10 times more.
We received two threads of execution with virtually no loss in speed compared to the single-threaded option.

Affinity is set to bitmask on both OSs. On a 4th nuclear machine, get_affinity returns the value 15 (1 + 2 + 4 + 8).

The example and all code for article laid out on github .
I accept any suggestions and complaints.
Also interested in the results on the processor with support for HT and other versions of Linux.

All from the first of April! This code really works :)

Source: https://habr.com/ru/post/141181/

All Articles