But how does multithreading work? Part II: memory ordering

Java Duke pokes a finger at DiCaprio’s nose, as if hinting at us: we need to go deeper

The knowledge of flow management that we received in the past topic , of course, is great, but there are still a lot of questions. For example: “How does the happens-before work?” , “Is it true that volatile is a reset of caches?” , “Why was there any reason to build a memory model? Was it all normal that something started like that? ”

Like the previous article, this one is built on the principle of “first briefly describe what should happen in theory, and then go to the source and see how it happens there.” Thus, the first part is largely applicable not only to Java, and therefore developers for other platforms may find something useful for themselves.

')

Theoretical minimum

The increasing productivity of iron is increasing for a reason. Engineers who develop, say, processors, come up with a variety of different optimizations that allow you to squeeze even more abstract parrots out of your code. However, there is no free performance, and in this case the price turns out to be the possible counterintuitiveness of how your code is executed. There are a lot of various features of iron hidden from us by abstractions. I recommend to those who have not yet done so, read the report by Sergei Walrus Kuksenko, who is called Quantum Performance Effects, and perfectly demonstrates how unexpectedly your abstractions can flow. We will not go far for example, and take a look at the caches.

Cache device

A request for “main memory” is an expensive operation, and even on modern machines it can take hundreds of nanoseconds. During this time, the processor could have time to perform a lot of instructions. To avoid obscenity in the form of eternal downtime, used caches. In simple words, the processor stores right next to itself copies of frequently used main memory contents. More complicated words about different types of caches and their hierarchies can be read here , and we are more interested in how the relevance of the data in the cache is guaranteed. And if in the case of a single processor (or a core, the term processor will be used in the future), obviously, there are no problems, then if there are several cores (YAY MULTITHREADING!) , Questions already begin to arise.

How can processor A know that processor B has changed some value if A has it cached?

Or, in other words, how to ensure the coherence of caches?

In order for different processors to have a consistent picture of the world, they must communicate with each other in some way. The rules that they follow in this communication are called cache coherence protocol .

Cache Coherence Protocols

There are many different protocols, and they vary not only from the manufacturer of iron to the manufacturer of iron, but are constantly evolving even within a single vendor. However, despite the vastness of the protocol world, most of them have some common points. Slightly diminishing the commonality, we will consider the MESI protocol . Of course, there are approaches that differ radically from him: for example, Directory Based . However, they are not considered in this article.

In MESI, however, each cell in the cache can be in one of four states:

I nvalid: no value in cache
E xclusive: the value is only in this cache, and it has not yet been changed.
M odified: the value is changed by this processor, and it is not yet in the main memory or in the cache of any other processor.
S hared: value is present in the cache of more than one processor.

To go out of state, there is a message exchange, the format of which is also part of the protocol. By the way, it is rather ironic that at such a low level the change of states occurs precisely through the exchange of messages. Problem, Actor Model Haters?

In order to reduce the size of the article and the reader’s self-study, I will not describe the exchange of messages in detail. Those interested can get this information, for example, in the wonderful article Memory Barriers: a Hardware View for Software Hackers . Traditionally, deeper reflections on the topic of cheremin can be read in his blog .

Cleverly skipping the description of the messages themselves, we will make two comments about them. First, messages are not delivered instantly, with the result that we get latency on state changes. Secondly, some messages require special processing, resulting in processor downtime. All this leads to various problems of scalability and performance.

Optimizations for MESI and the problems they cause

Store buffers

In order to write something to a memory cell that is in the Shared state, you need to send an Invalidate message and wait for everyone to confirm it. All this time, the processor will be idle, which is incredibly sad, because the time during which the message will arrive is usually several orders of magnitude higher than is necessary to perform simple instructions. To avoid such a senseless and merciless loss of CPU time, we invented Store Buffers . The processor places the values it wants to write into this buffer and continues to execute the instructions. And when the necessary Invalidate Acknowledge is received, the data is finally sent to main memory.

Of course, there are several underwater rakes. The first of them are very obvious: if, before the value leaves the buffer, the same processor will try to read it, it will receive something completely different from what it has just written. This is solved with the help of Store Forwarding : it is always checked whether the requested value is in the buffer; and if it is there, then the value is taken from there.

But the second rake is much more interesting. Nobody guarantees that if the cells were placed in the same order in the store buffer, then they will be written in the same order. Consider the following piece of pseudocode:

 void executedOnCpu0() { value = 10; finished = true; } void executedOnCpu1() { while(!finished); assert value == 10; }

It would seem that could go wrong? Contrary to how you can think a lot. For example, if it turns out that by the time the code executes, finished , Cpu0 is in the Exclusive state, and value is in the state, for example, Invalid, then value will leave the buffer later than the finished . And it is quite possible that Cpu1 will read finished as true , while value will not be equal to 10. This phenomenon is called reordering . Of course, reordering occurs not only in this case. For example, the compiler for any reason may well swap some instructions.

Invalidate Queues

As you can easily guess, store buffers are not infinite, and therefore tend to overflow, as a result of which you still often have to wait for Invalidate Acknowledge. And they can sometimes run for a very long time if the processor and cache are busy. This problem can be solved by introducing a new entity: Invalidate Queue . All requests for invalidation of memory cells will be placed in this queue, and acknowledgment will be sent instantly. In fact, the values will be invalidated when the processor is comfortable. In this case, the processor promises to behave well, and will not send any messages on this cell as long as it is not disabled. Feel the catch? Let's return to our code.

 void executedOnCpu0() { value = 10; finished = true; } void executedOnCpu1() { while(!finished); assert value == 10; }

Suppose we were lucky (or we used some kind of secret knowledge), and Cpu0 recorded the memory cells in the order we need. Does this guarantee that they will fall into the Cpu1 cache in the same manner? As you could understand, no. We will also assume that the value cell is now in the Cpu1 cache in the Exclusive state. The order of the actions taking place then may be:

#	Cpu0	Cpu0: value	Cpu0: finished	Cpu1	Cpu1: value	Cpu1: finished
0	(...)	0 (Shared)	false (Exclusive)	(...)	0 (Shared)	(Invalid)

one	`value = 10; - store_buffer(value) ← invalidate(value)`	0 (Shared) (10 in store buffer)	false (Exclusive)
2				`while (!finished); ← read(finished)`	0 (Shared)	(Invalid)
3	`finished = true;`	0 (Shared) (10 in store buffer)	true (Modified)
four				`→ invalidate(value) ← invalidate_ack(value) - invalidate_queue(value)`	0 (Shared) (in invalidation queue)	(Invalid)
five	`→ read(finished) ← read_resp(finished)`	0 (Shared) (10 in store buffer)	true (Shared)
6				`→ read_resp(finished)`	0 (Shared) (in invalidation queue)	true (Shared)
7				`> assert value == 10;`	0 (Shared) (in invalidation queue)	true (Shared)
				Assertion fails
N				`- invalidate(value)`	(Invalid)	true (Shared)

Multithreading is simple and straightforward, isn't it? The problem is in steps (4) - (6). Having received invalidate in (4), we do not execute it, but write it to the queue. And in step (6) we get a read_response to the read request, which was sent earlier to (2). However, this does not make us invalidate the value , and therefore assertion falls. If operation (N) had been completed earlier, then we would still have a chance, but now this damn optimization has broken us all! But on the other hand, she is so fast and gives us ultra-louitensi ™! That's a dilemma. Iron developers cannot magically know in advance when the use of optimization is permissible, and when it can break something. And so they pass the problem on to us, adding: “It's dangerous to go alone. Take this! ”

Hardware Memory Model

The magic sword supplied by the developers who went to fight dragons is actually not a sword, but rather the Rules of the Game. They describe what values the processor can see when they or other processors perform certain actions. But Memory Barrier is already something much more like a sword. In our example MESI there are such swords:

Store Memory Barrier (also ST, SMB, smp_wmb) is an instruction that forces the processor to execute all the stores that are already in the buffer, before executing those that follow this instruction

Load Memory Barrier (also LD, RMB, smp_rmb) is an instruction that forces the processor to apply all invalidate ones already in the queue before executing any load instructions

Having a new weapon at our disposal, we can easily repair our example:

 void executedOnCpu0() { value = 10; storeMemoryBarrier(); finished = true; } void executedOnCpu1() { while(!finished); loadMemoryBarrier(); assert value == 10; }

Great, everything works, we are satisfied! You can go and write cool productive and correct multi-threaded code. Although stop ...

It would seem, where is Java?

Write Once @ Run Anywhere

All of these various cache coherence protocols, membranes, flushed caches and other platform-specific things should not, in theory, bother those who write Java code. Java is platform-independent, right? Indeed, there is no concept of reordering in the Java Memory Model.

NB : If this phrase confuses you, do not continue reading the article until you understand why. And read, for example, this .

In general, it sounds interesting. There is no “reordering” concept, but there is reordering itself. The authorities are clearly hiding something! But even if we abandon the conspiracy assessment of the surrounding reality, we will remain with curiosity and a desire to know. Saturate him! Take a simple class that illustrates our recent example: ^{[ github ]}

 public class TestSubject { private volatile boolean finished; private int value = 0; void executedOnCpu0() { value = 10; finished = true; } void executedOnCpu1() { while(!finished); assert value == 10; } }

Traditionally, there are several approaches to finding out what is happening there. You can have fun with PrintAssembly , you can see what the interpreter does, you can overcome the secret knowledge of those who already know. You can say with a mysterious look that the caches are dropped there and calm down.

Last time we looked at the sish interpreter, which is not really used in production . This time we will look at how the client compiler (C1) operates. I used openjdk-7u40-fcs-src-b43-26_aug_2013 for my own purposes.

For a person who has not previously opened the source code of OpenJDK (as well as for the one who opened it), it can be a difficult task to find where the necessary actions are performed in them. One of the easiest ways to do this is to look at the bytecode and find out the name of the instruction you need, and then look for it.

 $ javac TestSubject.java && javap -c TestSubject void executedOnCpu0(); Code: 0: aload_0 //    this 1: bipush 10 //    10 3: putfield #2 //     this (value)    (10) 6: aload_0 //    this 7: iconst_1 //    1 8: putfield #3 //     this (finished)    (1) 11: return void executedOnCpu1(); Code: 0: aload_0 //    this 1: getfield #3 //      this (finished) 4: ifne 10 //    ,     10( ) 7: goto 0 //     10: getstatic #4 //     $assertionsDisabled:Z 13: ifne 33 //  assertions ,    33() 16: aload_0 //    this 17: getfield #2 //      this (value) 20: bipush 10 //    10 22: if_icmpeq 33 //      ,    33() 25: new #5 //   java/lang/AssertionError 28: dup //      29: invokespecial #6 //   ( <init>) 32: athrow //  ,      33: return

NB : You should not try to determine the exact behavior of the program in runtime by bytecode. After the JIT compiler does its job, things can change very much.

What interesting things can we notice here? The first little thing that many people forget is that assertions are turned off by default. You can enable them in -ea using the -ea key. But it is, nonsense. What we came here for is the names of the getfield and putfield . Are you thinking the same thing about me? (Of course, Gleb! Just how do we build the Dyson Sphere of bacon, a plunger, and two bras ?!)

Down the Rabbit Hole

Drawing attention to the fact that the same instructions are used for both fields, let's see where the information is that the field is volatile . To store field data, the class share/vm/ci/ciField.hpp . We are interested in the method

 bool is_volatile () { return flags().is_volatile(); }

To find out what C1 does with access to volatile fields, you can find all the uses of this method. After a little wandering around the dungeons and collecting several Scrolls of Ancient Knowledge, we find ourselves in the file share/vm/c1/c1_LIRGenerator.cpp . As his name suggests, he is engaged in the generation of the low-level intermediate representation ( LIR , Low-Level Intermediate Representation) of our code.

C1 Intermediate Representation on the example of `putfield`

When creating IR in C1, our putfield instruction is eventually processed here. Consider special actions that are performed for volatile fields and rather quickly come across familiar words:

 1734 1735 1736

 if (is_volatile && os::is_MP()) { __ membar_release(); }

Here __ is a macro that is expanded in gen()->lir()-> . And the membar_release method membar_release defined in share/vm/c1/c1_LIR.hpp :

 void membar_release() { append(new LIR_Op0(lir_membar_release)); }

In fact, this line added the membar_release instruction to the intermediate representation of our code. After this, the following occurs:

 1747 1748 1749

 if (is_volatile && !needs_patching) { volatile_field_store(value.result(), address, info); }

The implementation of the volatile_field_store method is already platform dependent. On x86 ( cpu/x86/vm/c1_LIRGenerator_x86.cpp ), for example, the actions are quite simple: it is checked whether the field is 64-bit, and if so, then Black Magic is used to ensure that the recording is atomic. Still, remember that in the absence of a volatile , long and double fields can be written nonatomically ?

And finally, at the very end, another membar is placed, this time without release:

`1759 1760 1761`	`if (is_volatile && os::is_MP()) { __ membar(); }`
`1956`	`void membar() { append(new LIR_Op0(lir_membar)); }`

NB : I, of course, treacherously concealed some of the actions taking place. For example, the manipulations associated with GC. Examine them is offered to the reader as an independent exercise.

Convert IR to assembler

We passed only ST and LD, and here there are new types of barriers. The fact is that what we saw before is an example of barriers for a low-level MESI. And we have already moved to a higher level of abstraction, and the terms have changed somewhat. Suppose we have two types of memory operations: Store and Load. Then there are four ordered combinations of two operations: Load and Load, Load and Store, Store and Load, Store and Store. We considered two categories: StoreStore and LoadLoad - and there are those barriers that we saw, speaking of MESI. The other two should also be fairly easily digestible. All loads produced prior to the LoadStore must complete before any store after. With StoreLoad , respectively, the opposite. More details about this can be found, for example, in the JSR-133 Cookbook .

In addition, the concepts of operations with Acquire semantics and operations with Release semantics are distinguished. The latter is applicable to write operations, and ensures that any actions with memory going before this operation must complete before it starts. In other words, the operation with write-release semantics cannot be reorder with any memory operation going to it in the program text. Such a semantics can provide us with the LoadStore + StoreStore memory barrier combination. Acquire, as you might guess, has opposite semantics, and can be expressed using the LoadStore + LoadLoad combination.

Now we understand which membranes the JVM places. However, so far we have only seen this in LIR, which, although Low-level, is still not a native code that should generate us a JIT. Exploring exactly how C1 converts LIR into native code is beyond the scope of this article, so we’ll go straight to the file share/vm/c1/c1_LIRAssembler.cpp without any reservations. This is where all the transformation of IR into assembler code happens. For example, in a very sinister line, lir_membar_release considered:

 665 666 667

 case lir_membar_release: membar_release(); break;

The called method is already platform dependent, and the source code for x86 is in cpu/x86/vm/c1_LIRAssembler_x86.cpp :

 3733 3734 3735 3736

 void LIR_Assembler::membar_release() { // No x86 machines currently require store fences // __ store_fence(); }

Gorgeous! Thanks to a strict memory model (including TSO - Total Store Order Order), on this architecture, all entries already have release semantics. But with the second membar everything is a bit more complicated:

 3723 3724 3725 3726

 void LIR_Assembler::membar() { // QQQ sparc TSO uses this, __ membar( Assembler::Membar_mask_bits(Assembler::StoreLoad)); }

Here the macro __ takes place in _masm-> , and the membar method lies in cpu/x86/vm/assembler_x86.hpp and looks like this:

 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263

 void membar(Membar_mask_bits order_constraint) { if (os::is_MP()) { // We only have to handle StoreLoad if (order_constraint & StoreLoad) { // All usable chips support "locked" instructions which suffice // as barriers, and are much faster than the alternative of // using cpuid instruction. We use here a locked add [esp],0. // This is conveniently otherwise a no-op except for blowing // flags. // Any change to this code may need to revisit other places in // the code where this idiom is used, in particular the // orderAccess code. lock(); addl(Address(rsp, 0), 0);// Assert the lock# signal here } } }

It turns out that on x86 to write each volatile variable we set an expensive StoreLoad barrier in the form of lock addl $0x0,(%rsp) . The operation is expensive because it forces us to run the entire Store in a buffer.However, it gives us the same effect that we expect from volatile- all other streams will see at least the value that was relevant at the time of its execution.

It turns out that read on x86 should be the most common read. A cursory inspection of the method LIRGenerator::do_LoadFieldtells us that after reading, as we expected, membar_acquire is set, which looks like this on x86:

 3728 3729 3730 3731

 void LIR_Assembler::membar_acquire() { // No x86 machines currently require load fences // __ load_fence(); }

This, of course, does not mean yet that it volatile readdoes not introduce any overhead projector as compared with the usual one read. For example, even though nothing is added to the native code, the presence of a barrier in the IR itself prohibits the compiler from rearranging some instructions. (otherwise you can catch funny bugs ). There are many other effects from using volatile. You can read about it, for example, in this article .

Check for lice

PrintAssembly

Sitting and guessing on the source code is a noble occupation worthy of any self-respecting philosopher. However, just in case, we still look in PrintAssembly. To do this, we add to the experimental rabbit many calls of the necessary methods in the loop, disable inlineing (to make it easier to navigate the generated code) and run in the client VM, without forgetting to enable assertions:

 $ java -client -ea -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:MaxInlineSize=0 TestSubject ... # {method} 'executedOnCpu0' '()V' in 'TestSubject' ... 0x00007f6d1d07405c: movl $0xa,0xc(%rsi) 0x00007f6d1d074063: movb $0x1,0x10(%rsi) 0x00007f6d1d074067: lock addl $0x0,(%rsp) ;*putfield finished ; - TestSubject::executedOnCpu0@8 (line 15) ... # {method} 'executedOnCpu1' '()V' in 'TestSubject' ... 0x00007f6d1d061126: movzbl 0x10(%rbx),%r11d ;*getfield finished ; - TestSubject::executedOnCpu1@1 (line 19) 0x00007f6d1d06112b: test %r11d,%r11d ...

That's nice, everything looks exactly as we predicted. It remains to check whether, in the absence of volatilesomething, something can go wrong. Earlier in his article, TheShade showed broken Double-Checked Locking , but we also want to pervert a little, and therefore we will try to break everything ourselves. Well, or almost yourself.

Demonstration of breakage without volatile

The problem of demonstrating such a re-ording is that, in general, the probability of its origin is not very large, and on some architectures, the HMM will not allow it at all. Therefore, we need to either get hold of the alpha or rely on the compiler reordering. And besides, run everything many, many, many times. It’s good that we don’t have to reinvent the wheel. Let's use the wonderful jcstress utility . Speaking quite simply, it repeatedly executes some code and collects statistics on the execution result, doing all the dirty work for us. Including the one that many people don’t suspect.

Moreover, the necessary test has already been written for us . More precisely, a little more complicated, but perfectly showing what is happening:

 static class State { int x; int y; // acq/rel var } @Override public void actor1(State s, IntResult2 r) { sx = 1; sx = 2; sy = 1; sx = 3; } @Override public void actor2(State s, IntResult2 r) { r.r1 = sy; r.r2 = sx; }

We have two streams: one changes the state, and the second reads the state and stores the result that it saw. The framework for us aggregates the results, and checks them by some rules . We are interested in two results that the second stream can see: [1, 0]and [1, 1]. In these cases we read y == 1, but at the same time we either did not see any entries in x( x == 0) at all, or we saw not the latest one at the time of the recording y, that is x == 1. According to our theory, such results should occur. Check it out:

 $ java -jar tests-all/target/jcstress.jar -v -t ".*UnfencedAcquireReleaseTest.*" ... Observed state Occurrence Expectation Interpretation [0, 0] 32725135 ACCEPTABLE Before observing releasing write to, any value is OK for $x. [0, 1] 15 ACCEPTABLE Before observing releasing write to, any value is OK for $x. [0, 2] 36 ACCEPTABLE Before observing releasing write to, any value is OK for $x. [0, 3] 10902 ACCEPTABLE Before observing releasing write to, any value is OK for $x. [1, 0] 65960 ACCEPTABLE_INTERESTING Can read the default or old value for $x after $y is observed. [1, 3] 50929785 ACCEPTABLE Can see a released value of $x if $y is observed. [1, 2] 7 ACCEPTABLE Can see a released value of $x if $y is observed.

Here we can see that in 65960 cases out of 83731840 (approximately 0.07%) we saw y == 1 && x == 0that it clearly speaks about the reordering that occurred. Hooray, you can tie.

The reader should now have a fairly good understanding of what is happening to answer the questions asked at the beginning of the article. I recall:

How does it happen-before?
Is it true that volatilethis is a dump of caches?
Why bother with any kind of memory model at all?

Well, everything fell into place? If not, then you should try to delve into the appropriate section of the article again. If that doesn't help, welcome to the comments!

Perform transformations over the source code can not only iron, but the entire execution environment. To comply with JMM requirements, restrictions are imposed on all components where something can change. For example, the compiler may in general reset some instructions, however, many optimizations may be forbidden for him to do JMM.

Of course, the server compiler (C2) is significantly smarter than C1, which we have reviewed, and some things in it are very different. For example, the semantics of working with memory is absolutely different.

In the gut multithreading OpenJDK in many places is used to checkos::is_MP(), which allows to improve performance on single-processor machines, without performing some operations. If using the Forbidden Arts to force the JVM to think at the start that it is executed on one processor, then it will not last long.

Many thanks to the valiant TheShade , cheremin and artyushov for the fact that they (you | pro) read the article before publication, making sure that I would not bring some kind of nonsense to the masses instead of light, filled with dull jokes and ochepatkami.

Source: https://habr.com/ru/post/209128/

All Articles

But how does multithreading work? Part II: memory ordering

Theoretical minimum

Cache device

Cache Coherence Protocols

Optimizations for MESI and the problems they cause

Store buffers

Invalidate Queues

Hardware Memory Model

It would seem, where is Java?

Write Once @ Run Anywhere

Down the Rabbit Hole

C1 Intermediate Representation on the example of `putfield`

Convert IR to assembler

Check for lice

PrintAssembly

Demonstration of breakage without volatile

And one more thing ©

More articles:

But how does multithreading work? Part II: memory ordering

Theoretical minimum

Cache device

Cache Coherence Protocols

Optimizations for MESI and the problems they cause

Store buffers

Invalidate Queues

Hardware Memory Model

It would seem, where is Java?

Write Once @ Run Anywhere

Down the Rabbit Hole

C1 Intermediate Representation on the example of putfield

Convert IR to assembler

Check for lice

PrintAssembly

Demonstration of breakage without volatile

And one more thing ©

More articles:

C1 Intermediate Representation on the example of `putfield`