The last article is about classic synchronization primitives.
(Probably, then I will write another one about a completely non-typical task, but this later.)
Today we look at the processor a little bit. A little bit.
')
In fact, we will talk about a single primitive, which is fundamentally different from the rest: spinlock. Spinlock.
In the comments to the previous notes there was a discussion - how fair is it to allocate the spinlock as a primitive, because in fact it is just a mutex, right? It performs the same function - it prohibits the simultaneous execution of a code fragment by several parallel threads.
At the process level, everything is as it is - the differences between the spinlock and the mutex are purely technical, the question of implementation and performance.
But I am interested in this topic not only from the standpoint of a user programmer, but also from the standpoint of the kernel developer, as well as the developer of the synchronization primitives themselves. And here the difference is fundamentally.
The fact is that inside the core, the mutex is implemented with the help of spinlock, but the spinlock is implemented by itself, autonomously. They are truly a basic primitive. Below - only the processor itself.
There is one more semantic difference. The mutex allows and assumes the removal of the thread from the processor, the long stopping of the calling thread. Mutex can lock an object for an hour or a day, this is acceptable and normal. Spinlock is principally designed only for the shortest suspensions; it is always work with a non-atomic state of the object. Assigning a group of variables, a small cycle is the maximum of what can be done under the spinlock.
So, the implementation hierarchy is as follows: mutex / cond / sema are made on the basis of spin locks, spin locks on the basis of atomic operations provided by the processor. We will look at them a little today.
How is the spinlock?
The principle is incredibly simple. Spinlock is just a variable that contains zero or one (there are options).
If zero - spinlock is free, and it can be captured. If not zero, the spinlock is locked, and the thread that wants to grab it will wait, spinning (spin) in a small cycle of continuous testing of the spinlock release. Here is the implementation:
while( ! _spin_try_lock( &(sl->lock) ) ) while( sl->lock ) ;
The _spin_try_lock operation is atomic, implemented in an assembler of the corresponding processor. We are trying to lock the spinlock, if it was not possible - we are spinning in a cycle while it is locked, then we try again.
That's it, in general.
Now the details. The _spin_try_lock operation is also very simple:
#define _spin_try_lock(p)\ (!({ register int _r__; \ __asm__ volatile("movl $1, %0; \n\ xchgl %0, %1" \ : "=&r" (_r__), "=m" (*(p)) ); \ _r__; }))
The fact is that the Intel processor xchgl instruction atomically swaps the register and variable in memory. Atomic, taking into account the possible work of other processors, that is, in the SMP environment.
What's happening? We swap the value of sl-> lock and the register in which the unit lies. If the spinlock was not locked (sl-> lock was zero), it will become equal to one and _spin_try_lock will return zero - we successfully locked the spinlock. If sl-> lock was equal to one, then ul-> lock after exchange will again be equal to one, but the result of _spin_try_lock will be the previous value of sl-> lock - one, which means that the spinlock has failed.
In general, the problem with atomic operations at the processor level is very large. As long as the system has one processor, this problem does not exist. But the situation is also real, too, is long gone. Even in cheap machines, there are already several processors, or one multi-core, and / or hyperthread.
In order for the memory operation to work atomically, you need to ensure that no processor in the system will touch this memory cell. This is implemented within the framework of interprocessor communication in the cache memory coherence control system in the form of a ban on other processors from accessing the memory cell.
It is interesting that there is a completely different scheme of work with atomic objects. The mips processors do not have instructions like "atomically do something with a memory cell." Instead, MIPS has an interesting pair of instructions:
LL - load linked
SC - store conditional
The first instruction, load linked, simply reads the value from memory. But at the same time, a trigger is cocked in the processor, which “monitors” the read cell - was there a record in it.
The second, the store conditional, saves the value to memory. But! If, since execution time load linked to this cell, someone has already written a new value, the store conditional will return a failure sign (and, of course, nothing will be recorded in the memory cell).
How it is arranged insideApparently, the MIPS mechanism is much cheaper to implement, because it uses the existing mechanisms for synchronizing the state of the cache memory between processors. The mentioned trigger is reset if the processor receives an update of the cache state for the address on which the trigger hangs. That is, the "atomic" operation does not require any work on the maintenance of atomicity from neighboring processors.
In contrast to Intel's xchg, such a pair allows you to not only make spinlocks, but also implement more complex algorithms. Back to
the semaphore conversation :
rpos = atomic_add( &read_pos, 1 ); ret = buf[rpos]; read_pos %= sizeof(buf);
All problems with this code on a MIPS processor can be solved quite simply:
do { rpos = load_linked( &read_pos ) ret = buf[rpos++]; rpos %= sizeof(buf); } while( !store_conditional( &read_pos, rpos ) );
Unfortunately, this code will not be portable. Although quite effective. However, for really critical areas, you can also perform processor-specific optimization.
What else is important to know about spinlocks in the OS kernel environment (or when programming for microcontrollers, where, as a rule, there is no user mode): when capturing the spinlock, interrupts should be prohibited.
The fact is that the spinlock can protect us from another processor, but not from our own - if an interrupt happens inside a locked spinlock and the processor tries to capture the same spinlock in the interrupt, a clinch will come: there is no one to unlock the spinlock before returning from the interruption until the spinlock is unlocked.
Therefore, a typical implementation of spinlockes either prohibits interruptions by itself, or, at least, checks whether the calling code has forgotten to forbid them.
Returning to the issue of mutexes and spin locks: there is not much that can be inside the spinlock. We can not sleep - they are waiting for us, it is impossible to switch the context (to give the processor) - this is fraught with deadlock in the same way as with interruptions. You cannot call functions that have a chance to do the above. For example, in the kernel or in the code of the microcontroller it may be impossible to call malloc - as a rule, it is synchronized itself and can wait for its release if the memory is low. Well, for example, the function of writing to the log can be banned - especially if they are trying to send data to the logging server via syslog over UDP.
With all the above, in interrupts, spinlocks can and should be used. Actually, only they can. True, it only makes sense if there are several processors in the machine. Otherwise, all synchronization is reduced to the prohibition of interruptions, including the prohibition of interruptions when executing the interrupt itself.
On this we find the review of traditional synchronization primitives completed. After a week, I will try to return to the topic and finally blow your brain with an article about the problems of implementing synchronization primitives in a persistent virtual memory environment. :)
This is a series of articles "Overview of synchronization primitives":