In this article, I would like to document everything that I know about which tools can be used to effectively use the computing resources of the systems and / or the convenience of development.
And, I hope, it may be useful for someone, because someone may not know something, or, on the contrary, it will be useful for me if someone shows something else / points out the flaws in my knowledge.
Once the processors were single-core (until 2001) , and there was only one processor in the systems (until 1966) . And then the instructions were executed simultaneously in the same system.
Threads - an abstraction of the operating system that allows you to execute certain pieces of code in parallel. Each thread has a "context", which includes a piece of memory under the stack and a copy of the processor registers. Obviously, there may be more threads running in the system than we have processors and cores, so the operating system is responsible for the fact that schedule'it (scheduling, scheduling) processes, restoring registers in the processor cores to work for some time , then saves it back and lets it run as follows.
Planning should be good, so that each thread works for approximately the same time (but at the same time it is quite small, that is, we will not be satisfied if each thread will work for a few seconds), and the planning itself takes as little time as possible. Here we traditionally come to the balance of latency / throughput: if we let the threads work for very small periods of time - the delays will be minimal, but the ratio of useful time to context switching time will decrease, and less useful in the end - delays will become noticeable, which may appear on the user interface / IO.
I would also like to note a rather important thing to which we will be sent later. Streams can be immersed in sleep. We can ask the operating system to put the thread to sleep for a while. Then she can take this into account when planning, excluding this thread from the queue to get processor time, which will give more resources to other threads, or reduce energy consumption. In addition, in theory, you can set priorities so that some threads work longer or more often than others.
We had a lot of cores, but the memory was one solid, and it remained (although in modern memory organization you can break a leg, there is physical and virtual, and pages go here and there, something is cached somewhere, checked for mistakes, it is strange that all this still works at all).
And when we start changing the same memory from several streams, everything can go completely differently from what we wanted. There may be “races” when, depending on the order in which the operations are performed, you will expect a different result, which can lead to data loss. And, even worse, the data may deteriorate completely in some cases. For example, if we write some value to an unaligned memory address, we need to write to two memory cells, and it may happen that one stream writes to one cell and the other to another. And if you were writing a pointer, or an array index, then most likely you will get the program crashed after a while, when you try to access someone else's memory.
To prevent this from happening, you need to use synchronization primitives to get guarantees that some blocks of code will be executed in a strict order with respect to some other blocks of code.
The simplest primitive is: we have a boolean variable, if true means blocking has been obtained by someone, false is free. And two methods: Lock , Unlock . Unlock sets the value to false . Lock in a loop does either TAS or CAS (more on that later), to atomically change false to true , and try until it works.
Having such a thing we can make blocks of code that will be executed exclusively (i.e. while one is being executed, the other cannot start). At the beginning of the block, it executes the Lock method, at the end of the Unlock . It is important that he did not try to make Lock a second time, otherwise there will be no one to do Unlock and it will turn out deadlock .
Perhaps hanging buns. For example, saving the thread identifier, who took the lock, so that no one except the captured thread could do Unlock , and if we made Lock a second time - it did not loop, this is called a "mutex" implementation. And if, instead of a boolean, we have an unsigned number, and Lock waits for a number greater than zero, reducing it by one, we get a “semaphore” that allows some specified number of blocks to work simultaneously, but no more.
Since the lock works in an infinite loop, it simply “burns” resources without doing anything useful. There are implementations that, in case of failures, say to the scheduler “I’m everything, pass the rest of my time to other threads” . Or, for example, there are “adaptive mutexes” , which, when trying to acquire a lock, which is held by a thread that is currently being executed, use spinlock, and if not, it passes execution to other threads. This is logical, because if the stream that can free a resource is working now, then we have chances that it is about to be released. And if it is not satisfied, it makes sense to wait a bit more and transfer the execution to another thread, perhaps even to the one we are waiting for.
If possible, you should avoid using spin locks, they can only be used if the blocks of code that they protect are executed incredibly quickly, that the chances of a collision are extremely small. Otherwise, we can spend quite a lot of resources just to heat the room.
The best alternative to blocking is to put a stream to sleep so that the scheduler will exclude it from the contenders for CPU time, and then request to wake up the first thread from another thread so that it can process the new data that is already in place. But this is true only for very long waits, but if the speed of release of locks is commensurate with one “tact” of the flow execution, it is more productive to use spinlocks.
In pursuit of speed, processor architects have created a lot of dark magic, which is trying to predict the future and can rearrange the processor instructions in places, if they are not dependent on each other. If each processor works only with its own memory, you will never notice that the instructions a = 5 and b = 8 were executed in a different order, not like you wrote in the code.
But with parallel execution, this can lead to interesting cases. Wikipedia example:
// Processor #1: while (f == 0); // Memory fence required here print(x);
// Processor #2: x = 42; // Memory fence required here f = 1;
It would seem that there could go wrong? The cycle will end when f is assigned to one, at the time of which the variable x seems to be 42 already. But no, the processor can arbitrarily interchange such instructions, and as a result, f = 1
can be executed f = 1
, then x = 42
42 , and something else can be displayed not 42 . And even more. If everything is normal with the write order, the read order can be changed, where the x value is read before we enter the loop.
To counteract this, memory barriers are used, special instructions that prevent the instructions from being moved through them. On Habré there is an excellent translation on this topic.
Since some operations with memory can be thread safe (for example, add another number to the number and save it back), there are operations for performing atomic operations for which there are special processor instructions.
Most commonly used:
Returning to our spinlock, the operation of obtaining a lock can be implemented as with TAS :
void lock() { // 1 , - 0 while (test_and_set(&isLocked, 1) == 0); }
So with CAS :
void lock() { // 1, 0 while (!compare_and_swap(&isLocked, 0, 1)) }
... as I said, I would like to document everything, but ... it turned out quite a lot, so I will break this thing down into three articles. Stay tuned.
UPD: The second part .
Source: https://habr.com/ru/post/318374/
All Articles