Synchronization primitives overview - Semaphore and some lockless

In the last note, we discussed the most famous pair of thread synchronization tool camps - mutex and cond. Today we will meet with sema - a primitive that can replace the previous two alone.

But first - a couple of words about random awakenings. (Thanks to xaizek , who reminded me of this.) In principle, strictly implemented synchronization mechanisms do not suffer from this, but, nevertheless, an experienced programmer never relies on it.

Let me remind you the code snippet:
')

while(total_free_mem <= 0) { wait_cond(&got_free_mem, &allocator_mutex); }

Here, the cycle around wait_cond guarantees us that even if we return from waiting for an event by accident or by mistake, nothing terrible will happen - checking in while will provide us with confidence that the desired state of the checked object has been reached. If not, we will sleep while waiting.

Note again that we check the state of the object (total_free_mem <= 0) when the mutex is locked, that is, no one can change it at the same time.

It is also advisable not to forget that the variables through which we exchange data between threads must be marked as volatile, otherwise the compiler will easily build us such an optimization that will easily break the logic of the code.

I’ll also mention very casually that the implementation of SMP in Intel processors is absolutely all-forgiving and guarantees the programmer that all processors see the memory in the same way, although they have separate caches. That is, they provide the so-called coherence of processor caches in a multiprocessor system.

This is not always the case. There are processors in which special efforts must be made to ensure that the data is guaranteed to be synchronized between the processors. Or explicitly turn off caching for memory pages in which data is available that can be accessed by multiple processors. This is a topic for a completely separate conversation, here I just mentioned for completeness.

Good. Let's return to semaphores.

For the semaphore, in fact, only two operations are defined:

  sem_acquire( &sema ) sem_release( &sema )

They work very simply. Each semaphore has an integer value.

The acquire operation is blocked if the value is less than or equal to zero (the semaphore is locked), and, at the end of the blocking (or immediately if the value is greater than zero), decreases the value of the semaphore by 1.

The release operation increases the semaphore value by 1 and awakens the thread that fell asleep in the acquire, if any.

The semaphore can be used both as mutex and as cond (but not simultaneously).

In fact, here is the mutex:

  //    = 1 sem_acquire( &sema ); //    sem_release( &sema );

Having performed acquire, the first thread will not fall asleep (the value was greater than zero), but will decrease the value by 1. If someone else tries to acquire acquire before the release, he will fall asleep, because the value is now zero.

And here is the cond:

 //    = 0 get_byte() { acquire(&got_data); ret = buf[read_pos++]; read_pos %= sizeof(buf); } put_byte(b) { buf[put_pos++] = b; release(&got_data); }

What do we see here? Attempting to remove a byte from the buffer will cause us to fall asleep on acquire, if no one byte has yet been placed in the buffer. But if we put a byte, we wake up and continue, let's go pick up the byte.

Such code would not work if instead of the semaphore cond. Consider the scenario: first, thread A calls put_byte, then thread B calls get_byte. If instead of a cond semaphore, an attempt to get get_byte would not stop at all - we would fall asleep waiting for cond, because we did not wait for this cond at the time of the signal_cond call, which means that put_byte did not wake anyone. And we fell asleep later wake-up.

With the semaphore, everything is fine, it has a state, and the semaphore can be “opened in reserve” - even before someone tried to close it. That is - release to acquire is also possible!

This is great, as it really simplifies the logic.

But the code above is still wrong. If get_byte happens at the same time after two put_byte, they “quarrel” at the point of working with read_pos - a nuisance can happen: the two threads will first increment the pointer, and then both will take the same byte.

Code and something else is wrong

In this code example, there is no check for buffer overflow. It can also be done through a semaphore, but “in the reverse inclusion” - release in get_byte, acquire - in put_byte, and the initial value is equal to the size of the buffer.

Not a deal. It turns out that you need to “cover” the buffer operation zones with an ordinary mutex or semaphore “in mode” of a mutex? Then what is the advantage of a semaphore over an ordinary cond?

It is.

Firstly, there are situations when one of the parties is guaranteed to work synchronously, from one thread. Then for this side no mutex is needed. This is often the case in a driver — the driver's buffer is filled (or, for the opposite direction, emptied) from an interrupt handler or a single driver thread, and for this part, blocking from simultaneous access to the buffer is not needed.

Secondly, let's apply the lego lockless algorithm:

  rpos = atomic_add( &read_pos, 1 ); ret = buf[rpos]; read_pos %= sizeof(buf);

Here atomic_add atomically performs rpos = read_pos ++ . This ensures that for two threads parallel execution will take place correctly - each will receive its own byte, although it is unknown in what order.

True, there will be problems with read_pos% = sizeof (buf); - strictly speaking, this operation should be done atomically “inside” atomic_add. That is, there must be a complete atomic read-increment-constraint operation.

We will understand the details

An atomic and portable read-increment-limit operation is not. On some processors (for example, MIPS), it can be organized, but we are writing portable code.

How can I fix the problem? And, for starters, what is it? Potential erroneous sequence for strands A and B:

Start value read_pos = sizeof (buf) -1;
A: read and increment
B: readout and increment, with the read value = sizeof (buf), that is, it goes beyond the array boundary.

Then two read_pos% = sizeof (buf); , but it's' too late.

Here we are lucky. Enough simple adjustments:

 rpos = atomic_add( &read_pos, 1 ) % sizeof(buf);

In addition, the operation read_pos% = sizeof (buf); as degs rightly pointed out , it must also be atomic, which is hardly realistic - gcc does not offer such an operation among the built-in atomic functions ( here ).

However, this operation can be simply excluded if the size of the array is a multiple of a power of two. As soon as we limit the size of the array read in atomic_add value, read_pos itself can no longer be touched - let it grow and turn into 0xFFFFFFFF to zero, the remainder of its division by the size of the array will always be correct.

(For a complete brain explosion, I’ll add that everything will work for non-multiple sizes, if the work with the read address and the write address is organized identically. Just when the counter overflows, the code does not use part of the buffer.)

About semaphores on this, perhaps, everything.

But a couple of words about priority inversion. This thing is typical and critical for a real-time OS, but since real-time priorities are today almost everywhere, everyone needs to know about it.

There is a deadlock problem, old as the world, in a theoretically correct program due to thread priorities.

Suppose we have threads A, B, and C, whose priorities are 3, 2, and 1, respectively, that is, A has the highest. All priorities are real-time class, that is, if A has something to do, then it receives the processor unconditionally, while B and C are waiting.

A and B work with a shared resource - for example, a low-priority B - this is a logger that takes the message flow from A to the log and writes them to the network or to a file. The operation is non-critical, therefore B has a low priority. Thread B is engaged in something less important than A, but durable. A controls the nuclear reactor, and should check the power and adjust the rods every 1 ms.

In normal life, A rotates in a cycle - he sleeps until the end of the 1 msec slot, then takes readings and steers the rods. B considers ~~crows~~ neutrons, and for the next 5-6 seconds she has a job. Accordingly, the thread B of the processor does not fall at all. B and A is more important.

Sooner or later, the buffer overflows, and A stops, waiting for the semaphore to put the baytik into the buffer.

It stops forever or at least for 5-6 seconds - C will not receive the processor until B has calculated it. As a result, the thread A will skip its timeslot and the reactor will be left without control.

To prevent this from happening, in realtime OS all synchronization primitives automatically raise the priority of the process they are waiting for to the maximum priority level among all the waiting threads.

That is, in our example, B will receive priority A and will take the bothersome B out of the processor so that A does not wait for too much.

Of course, the example is biased, but, I hope, he reflected the essence.

Next time we will look at spinlocks. In the comments they already write that ~~summer is a small life of a~~ spinlock - this is a small mutex, which is not without reason.

But as they say, there are nuances. :)

This is a series of articles "Overview of synchronization primitives":

Source: https://habr.com/ru/post/278661/

All Articles

Synchronization primitives overview - Semaphore and some lockless

More articles: