Linux memory access barriers

This article is a partial translation of a comprehensive guide by David Howells and Paul E. McKenney distributed as part of the Linux documentation (Documentation / memory-barriers.txt online version ).

Must read for kernel / driver developers and very informative for application programmers.

A generic memory access model.

Consider the following system model:

: : : : : : +-------+ : +--------+ : +-------+ | | : | | : | | | | : | | : | | | CPU 1 |<----->| Memory |<----->| CPU 2 | | | : | | : | | | | : | | : | | +-------+ : +--------+ : +-------+ ^ : ^ : ^ | : | : | | : | : | | : v : | | : +--------+ : | | : | | : | | : | | : | +---------->| Device |<----------+ : | | : : | | : : +--------+ : : :

Each processor executes a program that generates a sequence of memory access operations. An abstract processor has a weak memory access ordering model, so that, in fact, access operations can be performed in any order, if this does not violate the causal relationships inherent in the executable code. Similarly, the compiler is free to organize the generated code, provided that the result of the execution of this code will correspond to the source code of the program.
')
In the above figure, the memory operations performed by the processor are observed by the other components of the system, like operations crossing the boundaries between this processor and the system (indicated by dotted lines).

Operations with external devices.

Many devices have a control interface in the form of a set of registers with specific addresses in memory; The order of access to control registers is important for proper operation. For example, a network adapter may have a set of internal registers that are accessed through the address register (A) and the data register (D). The following code can be used to read the internal register 5:

  *A = 5; x = *D;

This code can generate the following memory access sequences:

 1  *A = 5, x =  *D 2 x =  *D,  *A = 5

The second sequence will almost certainly return an incorrect result, since the address register will be set after reading the data register.

Guarantees.

The processor guarantees at least the following:

Memory access operations that depend on one another are performed in the correct order within the same processor. In other words, for code
```
  Q = P; D = *Q; 
```
the processor will generate the following memory access operations:
```
  Q =  P, D =  *Q 
```
and only in that order.
Writes and reads in overlapping areas of memory performed by a single processor maintain their order within this processor. In other words, for code
```
  a = *X; *X = b; 
```
the processor will generate only the following memory access sequence:
```
  a =  *X,  *X = b 
```
And for the code
```
  *X = c; d = *X; 
```
- only this:
```
   *X = c, d =  *X 
```

Besides:

It should not be assumed that the access operations for independent reads and records will be generated in any particular order. In other words, for code

  X = *A; Y = *B; *D = Z;

one of the following sequences can be generated:

 1 X =  *A, Y =  *B,  *D = Z 2 X =  *A,  *D = Z, Y =  *B 3 Y =  *B, X =  *A,  *D = Z 4 Y =  *B,  *D = Z, X =  *A 5  *D = Z, X =  *A, Y =  *B 6  *D = Z, Y =  *B, X =  *A

It should be considered that adjacent and overlapping memory access operations can be combined or not performed at all. In other words, for code
```
  X = *A; Y = *(A + 4); 
```
one of the following sequences can be generated:
```
 1 X =  *A; Y =  *(A + 4); 2 Y =  *(A + 4); X =  *A; 3 {X, Y} =  {*A, *(A + 4) }; 
```
And for the code
```
  *A = X; Y = *A; 
```
- one of the following:
```
 1  *A = X; Y =  *A; 2  *A = Y = X; 
```

What are memory access barriers?

As shown above, independent memory access operations can be performed in a random order, which can be a problem for interaction between processors or between a processor and an external device. A mechanism is required to indicate to the compiler and the processor that order is needed.

Barriers to memory access are such a mechanism. They lead to partial ordering of memory access operations on both sides of the barrier.

Processors and other system devices use a variety of techniques to improve performance, including reordering operations, postponing and combining memory access operations, early reading of data, conversion prediction, and various types of caching. Barriers to access memory serve to suppress these mechanisms.

Varieties of barriers.

There are 4 main types of barriers:

Barriers record.

Write barriers ensure that all write operations to memory, for instructions preceding the barrier, will be performed by the processor prior to any write operation to memory, for instructions following the barrier, in terms of the rest of the system components.

Write barriers order only write operations in memory; one should not assume that they can have any effect on read operations.

Note: write barriers must have pair read barriers or data dependency barriers; see the section “Pairing barriers in the case of SMP” .
Barriers to data dependency.

Data dependency barriers are a weakened version of reading barriers. In cases where two read operations are performed in such a way that the second operation depends on the result of the first (for example, the first operation receives the address from which the second one will read), the data dependency barrier ensures that the data at the address of the second read operation will be relevant at the time perform this operation.

Data dependency barriers have an effect only on interdependent read operations; one should not assume that they can have an effect on write operations, independent or overlapping read operations.

Other processors generate sequences of write operations to memory, the results of which can be observed by the processor in question. The data dependency barrier ensures that if there are reads from the addresses that were written by another processor in front of the barrier, the results of all other write operations that preceded it will also be available to the processor passing the barrier after it has passed.

Note: the relationship between readings must be valid data; if the address of the second read operation depends on the result of the first one, but not directly, but is calculated based on the result of the first reading, for example, by conditional operators, then this is a control dependence, and in this case a read barrier or an access barrier should be used. See “Management dependency barriers” .

Note: data dependency barriers must have dual record barriers; see the section “Pairing barriers in the case of SMP” .
Barriers to reading.

The read barrier is a data dependency barrier with the guarantee that all read operations for instructions preceding the barrier will be executed by the processor prior to any read operation for instructions following the barrier from the point of view of the rest of the system components.

Read barriers order only read operations from memory; it should not be assumed that they can have any effect on write operations.

Note: read barriers must have dual write barriers; see the section “Pairing barriers in the case of SMP” .
Generalized memory access barriers.

The memory access barrier ensures that all read and write operations for instructions preceding the barrier will be performed before any read or write operations for instructions following the barrier from the point of view of the rest of the system.

The memory access barriers organize both read and write operations.

Memory access barriers imply read barriers and write barriers and can be used instead of both.

And a couple of implicit types of barriers:

Lock Lock (LOCK) operations.

Such operations behave as a barrier with unilateral permeability. They ensure that all memory access operations, for instructions following the lock seizure operation, will be executed after the lock seizure, in terms of the rest of the system components.

Memory access operations, for instructions preceding a lock seizure, can be performed both before and after its capture.

A lock lock operation almost always must have a pair lock release operation.
Unlock operations (UNLOCK).

Such operations also act as a barrier with unilateral permeability. They ensure that all memory access operations, for instructions preceding a lock release operation, will be executed before the lock is released, in terms of the rest of the system components.

Memory access operations, for instructions following the release of a lock, can be performed both before and after it is released.

It is guaranteed that lock capture and release operations are not reordered.

The use of lock and release locks basically eliminates the need for using other types of memory access barriers.

Memory access barriers are required only in the case of interprocessor interaction, or processor interaction with an external device. Memory access barriers are not needed in code that does not participate in these types of interactions.

Note: the stated warranty is minimal. Different processor architectures can provide stronger guarantees for barriers; however, they should not be relied upon in an architecture-independent code.

What barriers do not guarantee.

There is no guarantee that memory access operations for instructions preceding the barrier will be completed at the moment the barrier passes; conditionally, we can assume that the barrier “draws a line” in the queue of processor requests that certain types of requests cannot cross.
There is no guarantee that a memory access barrier implemented on one processor will have any effect on another processor or other device in the system. The indirect effect is expressed in the sequence of memory accesses observed by other devices performed by this processor (see, however, the next item).
There is no guarantee that the processor will observe the effects of memory access from other processors in the correct order, even if they use memory access barriers if this processor does not use the appropriate barriers itself (see the section “Pair barriers in the case of SMP” ).
There is no guarantee that no intermediate devices will reorder the memory access. The mechanisms for maintaining the coherence of the processor cache should transfer the effect of the barriers between the processors, but may interfere with their mutual order.

Barriers to data dependency.

The model of using data dependency barriers has a number of subtleties, the need for which is not always obvious.
Consider the following sequence of events:

  CPU 1 CPU 2 =============== =============== { A == 1, B == 2, C = 3, P == &A, Q == &C } B = 4; < > P = &B Q = P; D = *Q;

There is an explicit data dependency in the code, and it seems that at the end of the sequence Q can be either & A or & B, as well as (Q == & A) implies (D == 1), and (Q == & B ) implies (D == 4).

However, from the point of view of CPU2, P can be updated earlier than B, which will result in (Q == & B) and (D == 2) (o_O).

Although it may seem that such behavior is a violation of cause-effect relationships, or a problem of coherence, it is not. This behavior is observed on some types of processors (for example, DEC Alpha).

To correct the situation, a data dependency barrier (or stronger) between reading an address and reading data at this address is required:

  CPU 1 CPU 2 =============== =============== { A == 1, B == 2, C = 3, P == &A, Q == &C } B = 4; < > P = &B Q = P; <   > D = *Q;

In this case, the third outcome ((Q == & B) and (D == 2)) is impossible.

Note: this unnatural situation is most easily replicable on machines with a split cache, when, for example, one cache bank serves even and the other odd cache lines. In the case when P and B fall into lines of different parity, the uneven load on the cache banks can lead to the described effect.

Barriers to dependence on management.

The presence of a control dependency requires the use of a reading barrier; a data dependency barrier will not be enough. Consider the following code:

  q = &a; if (p) q = &b; <   > x = *q;

The barrier in this example does not have the desired effect, since in fact the dependence on the data between (p) and (x = * q) is absent, and there is a dependence on control. The processor may try to predict the result.
What is really required in this situation:

  q = &a; if (p) q = &b; < > x = *q;

Pair barriers in the case of SMP.

When organizing interprocessor interaction, certain types of barriers should always be used in pairs. Lack of pairing is almost certainly an error.

A write barrier should always have a pair in the form of a data dependency barrier, a read barrier, or a generalized memory access barrier. Similarly, the data dependency barrier and the read barrier should have a pair in the form of a write barrier or a generalized memory access barrier:

  CPU 1 CPU 2 =============== =============== a = 1; < > b = 2; x = b; < > y = a;

or:

  CPU 1 CPU 2 =============== =============================== a = 1; < > b = &a; x = b; <   > y = *x;

A reading barrier must always be present, the only question is what kind of primitive force should be chosen for it.

Note: write operations before a write barrier usually have paired read operations on the other side of the read barrier or data dependencies and vice versa:

  CPU 1 CPU 2 =============== =============== a = 1; }---- --->{ v = c b = 2; } \ / { w = d < > \ < > c = 3; } / \ { x = a; d = 4; }---- --->{ y = b;

Examples of memory access operations with barriers.

First: write barriers introduce partial order in the write operation. Consider the following sequence of events:

  CPU 1 =======================  A = 1  B = 2  C = 3 < >  D = 4  E = 5

This sequence reaches the memory coherence support system in the order that other parts of the system can observe as an unordered set of records {RECORD A, RECORD B, RECORD C}, occurring before an unordered record set {RECORD D, RECORD E}:

  +-------+ : : | | +------+ | |------>| C=3 | } /\ | | : +------+ }----- \ ----->   | | : | A=1 | } \/    | | : +------+ } | CPU 1 | : | B=2 | } | | +------+ } | | wwwwwwwwwwwwwwww } <---      | | +------+ }  ,      | | : | E=5 | }    ""    | | : +------+ } ,    | |------>| D=4 | } | | +------+ +-------+ : : | | ,     |    CPU 1 V

Second , data dependency barriers introduce partial order into data dependent read operations. Consider the following sequence of events:

  CPU 1 CPU 2 ======================= ======================= { B = 7; X = 9; Y = 8; C = &Y }  A = 1  B = 2 < >  C = &B  X  D = 4  C ( &B)  *C ( B)

Without barriers, CPU 2 can observe events produced by CPU 1 in arbitrary order, despite the use of the CPU 1 recording barrier:

  +-------+ : : : : | | +------+ +-------+ |   | |------>| B=2 |----- --->| Y->8 | |  | | : +------+ \ +-------+ | CPU 2 | CPU 1 | : | A=1 | \ --->| C->&Y | V | | +------+ | +-------+ | | wwwwwwwwwwwwwwww | : : | | +------+ | : : | | : | C=&B |--- | : : +-------+ | | : +------+ \ | +-------+ | | | |------>| D=4 | ----------->| C->&B |------>| | | | +------+ | +-------+ | | +-------+ : : | : : | | | : : | | | : : | CPU 2 | | +-------+ | |   ---> | | B->7 |------>| |  B | +-------+ | | | : : | | | +-------+ | |   ---> \ | X->9 |------>| |   \ +-------+ | |  B ----->| B->2 | +-------+ +-------+ : :

In the above example, CPU 2 observes the value B == 7, despite the fact that reading * C (which should return B) goes after reading C.
However, if there is a data dependency barrier between reading C and reading * C (i.e. B) on CPU 2,

  CPU 1 CPU 2 ======================= ======================= { B = 7; X = 9; Y = 8; C = &Y }  A = 1  B = 2 < >  C = &B  X  D = 4  C (gets &B) <   >  *C (reads B)

The picture looks like this:

  +-------+ : : : : | | +------+ +-------+ | |------>| B=2 |----- --->| Y->8 | | | : +------+ \ +-------+ | CPU 1 | : | A=1 | \ --->| C->&Y | | | +------+ | +-------+ | | wwwwwwwwwwwwwwww | : : | | +------+ | : : | | : | C=&B |--- | : : +-------+ | | : +------+ \ | +-------+ | | | |------>| D=4 | ----------->| C->&B |------>| | | | +------+ | +-------+ | | +-------+ : : | : : | | | : : | | | : : | CPU 2 | | +-------+ | | | | X->9 |------>| | | +-------+ | | ,   --> \ ddddddddddddddddd | |  ,  \ +-------+ | |  C   ----->| B->2 |------>| |  +-------+ | | : : +-------+

Third, the reading barrier partially orders read operations. Consider the following sequence of events:

  CPU 1 CPU 2 ======================= ======================= { A = 0, B = 9 }  A=1 < >  B=2  B  A

Without barriers, CPU 2 can observe events produced by CPU 1 in arbitrary order, despite the use of the CPU 1 recording barrier:

  +-------+ : : : : | | +------+ +-------+ | |------>| A=1 |------ --->| A->0 | | | +------+ \ +-------+ | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | | | +------+ | +-------+ | |------>| B=2 |--- | : : | | +------+ \ | : : +-------+ +-------+ : : \ | +-------+ | | ---------->| B->2 |------>| | | +-------+ | CPU 2 | | | A->0 |------>| | | +-------+ | | | : : +-------+ \ : : \ +-------+ ---->| A->1 | +-------+ : :

However, if there is a read barrier between reading B and reading A on CPU 2,

  CPU 1 CPU 2 ======================= ======================= { A = 0, B = 9 }  A=1 < >  B=2  B < >  A

partial order provided by CPU 1 will be observed by CPU 2 correctly:

  +-------+ : : : : | | +------+ +-------+ | |------>| A=1 |------ --->| A->0 | | | +------+ \ +-------+ | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | | | +------+ | +-------+ | |------>| B=2 |--- | : : | | +------+ \ | : : +-------+ +-------+ : : \ | +-------+ | | ---------->| B->2 |------>| | | +-------+ | CPU 2 | | : : | | | : : | |    -> \ rrrrrrrrrrrrrrrrr | |     \ +-------+ | |   B ---->| A->1 |------>| |   CPU 2 +-------+ | | : : +-------+

Barriers to reading and load speculation.

Many processor architectures can read from memory in advance: in cases where the processor sees a read command from memory in the pipeline and does not use an external bus for other commands, a read command can be issued in advance — even if the command execution flow has not reached the read instruction. This makes it possible for the read instructions to complete very quickly, because by the time it is executed, the processor may already receive the value read from the memory.

It may happen that the read value is not needed - for example, if the transition command occurs before the reading command. In such cases, the read value will be discarded or stored in the cache.

Consider the following example:

  CPU 1 CPU 2 ======================= =======================  B  }  , ,  }      A

which can generate the following commands:

  : : +-------+ +-------+ | | --->| B->2 |------>| | +-------+ | CPU 2 | : :| | +-------+ | |   , ---> --->| A->0 |~~~~ | |     +-------+ ~ | |   A : : ~ | | : :| | : : ~ | |     --> : : ~-->| |     : : | |    : : +-------+

Placing a read barrier or data dependency before a second read operation

  CPU 1 CPU 2 ======================= =======================  B   < >  A

will force the processor to revise the pre-read value, depending on the type of barrier used. If the pre-read memory area has not changed, the value will be used:

  : : +-------+ +-------+ | | --->| B->2 |------>| | +-------+ | CPU 2 | : :| | +-------+ | |   , ---> --->| A->0 |~~~~ | |     +-------+ ~ | |   A : : ~ | | : :| | : : ~ | | : : ~ | | rrrrrrrrrrrrrrrr~ | | : : ~ | | : : ~-->| | : : | | : : +-------+

however, if the memory area was overwritten by another processor, the previously read value will be discarded and the read command will be issued again:

  : : +-------+ +-------+ | | --->| B->2 |------>| | +-------+ | CPU 2 | : :| | +-------+ | |   , ---> --->| A->0 |~~~~ | |     +-------+ ~ | |   A : : ~ | | : :| | : : ~ | | : : ~ | | rrrrrrrrrrrrrrrrr | | +-------+ | |    ---> --->| A->1 |------>| |     +-------+ | | : : +-------+

Explicit barriers in the core.

Linux has the following types of barriers that operate at different levels:

Barriers to the compiler.
CPU barriers.
Record barriers in memory mapped I / O areas (MMIO).

Barriers to the compiler.

The explicit compiler barrier prevents the compiler from moving memory access operations from one side of the barrier to the other:

  barrier();

This is the only compiler level barrier.

The compiler barrier has no effect on the behavior of the processor, which can reorder memory access operations at its discretion.

CPU barriers.

The following 8 functions are processor barriers in Linux:

      SMP ===================== ======================= ===========================   mb() smp_mb()  wmb() smp_wmb()  rmb() smp_rmb()    read_barrier_depends() smp_read_barrier_depends()

All processor level barriers, with the exception of the data dependency barrier, include a compiler barrier.

Note: in the case of data dependency, one would expect the compiler to organize the reading instructions in the correct order (for example, a [b] read the value of b first, and then a [b]), but the C standard does not give such a guarantee.

When compiling a single-processor core, SMP versions of the barriers turn into compiler barriers, because one processor always observes its own memory operations in the correct order.

Note: SMP barriers should be used to control the order of shared memory operations in SMP systems, but in fact, it is enough to use blocking synchronization primitives.

Unconditional barriers should not be used to control the effects of SMP, as this results in poor performance on single-processor systems. Their main area of use is the ordering of access operations to I / O devices mapped into memory.

Record barriers in memory mapped I / O areas (MMIO).

The following function is a barrier to writing to the I / O memory mapped:

  mmiowb();

This function is an option of a mandatory write barrier and provides partial ordering of write operations to the MMIO region. Its effect can extend inland beyond the interface between the processor and the memory subsystem.

Effects of the processor cache.

How the operations on cached memory are observed by the components of the system depends largely on the cache located between the processor and the memory and the system for maintaining its coherence.

From the point of view of the memory access barriers under consideration, the processor cache and the system for maintaining its coherence are part of the system memory; the effect of memory access barriers manifests itself on the border depicted by the dotted line in the diagram:

  <--- CPU ---> : <-----------  -----------> : +--------+ +--------+ : +--------+ +-----------+ | | | | : | | | | +--------+ |  | | | : |  | | | | | | CPU |--->| |----->| CPU |<-->| | | | | | | | : | | | |--->|  | | | | | : | | | | | | +--------+ +--------+ : +--------+ | | | | : |  | +--------+ : |  | : | - | +--------+ +--------+ +--------+ : +--------+ |  | | | | | | | : | | |  | | | |  | | | : |  | | |--->| -| | CPU |--->| |----->| CPU |<-->| | |  | | | | | : | | | | | | | | | | : | | | | +--------+ +--------+ +--------+ : +--------+ +-----------+ : :

Although memory access operations can be performed by the processor without accessing the external bus, for example, from the cache, for other processors, the cache coherence support system still maintains the appearance that interprocessor interaction occurs through external memory by migrating cache lines to which access is provided.

The processor core executes instructions in the order that is convenient for it, while guaranteeing the preservation of cause-and-effect relationships in the executed program. Some instructions generate read or write requests that are placed in a queue of requests to the memory subsystem. The processor core places them in a queue in the appropriate order and continues execution until it needs to get the result of any of these instructions.

Memory access barriers control the order in which requests are transferred from the processor to the memory, as well as the observability of the effects of operations performed on memory by other components of the system.

Source: https://habr.com/ru/post/155517/

All Articles