In search of an optimal synchronization mechanism, I often came across posts like mutex vs ..., but I didn’t get much clarity on these maps in my map. Therefore, I decided to write a small test comparing the overhead of different types of blocking mechanisms. Perhaps we can say that the test measures the latency of the locking mechanisms. Its essence is that a certain number of threads compete for the resource. Resource -
volatile unsigned long int incremented;
“Useful” work - execute BIG_NUMBER increments.
The purpose of the test is to estimate the costs of synchronization, that is to say, to build graphs of the dependence of time costs on the number of competing threads for different synchronization mechanisms.
Bundles of threads (1..N pieces) perform the same number of increments, synchronizing in different ways.
Further on the structure of the test and the results
PINC (Parallel INCrementor).
Sources and results are on
githaba. You can download and run. Here I do not cite them because they are cumbersome. Limited to a brief description.
So, there is a global variable incremented. The volatile modifier is used so that when compiling with -O3, the variable does not fall into a register.
For the sequential version, the test looks like this:
for(long i = 0;i < BIG_NUMBER; i++) incremented++;
In the case of parallel execution, BIG_NUMBER should be equally divided between all threads, so it is convenient to choose it as the factorial of the maximum number of threads.
Factor N! = 1 * 2 * ... * M * ... * (N-1) * N. If we have M threads, then it is clear that each will get an equal number of increments.
Active test blocks (in parentheses are the keywords appearing in the test):
- The pure sequential increment is executed only in one thread (serial increment):
incremented++;
- Atomic increment (lock increment test)
__sync_fetch_and_add(&incremented,1)
Here we assume that this function is turned into something like “lock xaddl ..”
- CAS intement (lock compare & swap test)
do{ x = incremented; }while(!__sync_bool_compare_and_swap(&incremented,x , x + 1));
and this one in “lock cmpxchgl ..”.
')
- Spinlock (spinlock test)
pthread_spin_lock(&spin)
- Mutex (test mutexed)
pthread_mutex_lock(&mut)
Test structure
Three functions are created for each active block. Init, thread, fini - whose name speaks for itself. Init is executed before the start of the bundle of threads, fini - after. Thread - performs its share increments in each thread.
All threads are transferred a data structure containing the number of running threads, the value of a “large number”, and its own thread number. Own thread number is used to bind a thread to a given processor according to the formula:
= _ % _
The thread binding to the processor is made to ensure that the threads will be executed on different cores. Binding is implemented using the pthread_setaffinity_np function.
Each thread performs its share increments and ends. The execution time is measured in milliseconds. Next, run the next bunch of threads, or the next test.
results
A desktop computer was used as an experimental sample:
CPU model name: Intel® Core (TM) 2 Duo CPU E8400 @ 3.00GHz
Thread libs: NPTL 2.13
2 of 2 cores online
Additition threads: 100
Big number: 11! = 0x2611500
And this is what he gave out (The pictures are the same, but on different scales):
Every point here is time taken at 11! increments.



I will not start a philosophy about the advantages of individual mechanisms, since everyone is focused on their tasks. However, looking at the images is easy to estimate the overhead of each of them.
In principle, everything is more or less predictable. The only surprise is that sometimes spinlock is even better than CAS.
It would be very interesting to see such graphics for machines with a large number of cores. Unfortunately, I can not get data on a large range of architectures. Therefore, I urge habra people to download the test from the
github and commit the results back or post here.
Good luck.
Ps. Using the dough is simple (-h help):
make;./pinc -f11 -t50
Those. will be executed BIG_NUMBER = 11! increments for each synchronization mechanism in bundles of threads from 1 to (number_ cores + 50).
If you get enough material, publish it here. I think everyone will be interested. Thank.