But how does multithreading work? Part I: Synchronization

(post from the series “I have cloned my hotspot source code, let's look at them together”)
Everyone who faces multithreaded problems (be it performance or incomprehensible heisenbags), inevitably encounters in the process of solving problems like inflation, contention, membar, biased locking, thread parking and the like. But does everyone really know what is behind these terms? Unfortunately, as practice shows, not all .

Hoping to correct the situation, I decided to write a series of articles on this topic. Each of them will be built on the principle of "first briefly describe what should happen in theory, and then go to the source and see how it happens there . " Thus, the first part is largely applicable not only to Java, and therefore developers for other platforms may find something useful for themselves.

Before reading in-depth descriptions, it is useful to make sure that you are sufficiently familiar with the Java Memory Model. You can study it, for example, on the slides of Sergei Walrus Kuksenko or on my early topic . This presentation is also a great material, starting with slide # 38.

Theoretical minimum

As you know, every object in java has its own monitor , and therefore, unlike the same C ++, there is no need to guard access to the objects by separate mutexes. To achieve the effects of mutual exclusion and synchronization of threads use the following operations:

monitorenter : capture monitor. Only one stream can own a monitor at a time. If the monitor is busy at the time of the capture attempt, the stream attempting to capture it will wait until it is free. At the same time, there may be several threads in the queue.
monitorexit : monitor release
wait : move the current thread to the so-called monitor wait set and wait for notify occur. Exiting the wait method may also be false. After the thread owning the monitor has made a wait , any other stream can take over the monitor.
notify (all) : one (or all) threads are awakened, which are now in the monitor's wait set. To get control, the awakened stream must successfully capture the monitor (monitorenter)

')
Before continuing, we define an important concept:

contention - a situation when several entities simultaneously try to own the same resource that is intended for exclusive use

How contention to owning a monitor really depends on how it is captured. The monitor may be in the following states:

init : the monitor has just been created, and so far no one has been captured
biased : (smart optimization, which did not appear immediately) The monitor is “reserved” for the first stream that captured it. In the future, to capture this stream does not need expensive operations, and the capture occurs very quickly. When a capture tries to produce another stream, either the monitor is re-backed up for it ( rebias ), or the monitor changes to the thin ( revoke bias ) state. There are also additional optimizations that act immediately on all instances of the class of an object whose monitor is trying to capture ( bulk revoke / rebias )
thin : the monitor is trying to capture multiple streams, but there is no contention (i.e., they capture it not simultaneously, or with very little overlap) . In this case, the capture is performed using a relatively cheap CAS . If contention occurs, the monitor changes to the inflated state.
fat / inflated : synchronization is performed at the operating system level. The stream parks and sleeps until it is its turn to capture the monitor. Even if we forget about the cost of changing the context, then when the flow receives control, it also depends on the system scheduler, and therefore the time can pass significantly more than we would like. When contention disappears, the monitor may return to the thin state.

This is where abstract reasoning ends, and we dive into how it is implemented in hotspot.

Object Headers

Inside a virtual machine, object headers generally contain two fields: the mark word and a pointer to the class of the object. In particular cases, there may be something added: for example, the length of the array. These headers are stored in the so-called oop - ohs (Ordinary Object Pointer), and you can look at their structure in the file hotspot/src/share/vm/oops/oop.hpp . We will study in more detail what the mark word is, which is described in the markOop.hpp file located in the same folder. (Do not pay attention to the inheritance from oopDesc : it is only for historical reasons) In an amicable way, it should be carefully read, paying attention to detailed comments, but for those who are not very interested, below is a brief description of what is in this mark Word contains and in what cases. You can still see this presentation here from the 90th slide.

Content mark words

condition	Tag	Content
unlocked thin	01	`Identity hashcode`		`age`	`0`
locked, thin, unbiased	00	`mark word`
inflated	ten
biased	01	`id -`	`epoch`	`age`	`1`
marked for GC	eleven

Here we see several new values. First, identity hash code is the object hash code that is returned when System.identityHashCode called. Secondly, age - how many garbage assemblies survived the object. And there is also epoch , which indicates the number of bulk revocation or bulk rebias for the class of this object. To why this is necessary, we will come later.

Have you noticed that in the case of biased there was not enough space at the same time for identity hash code and threadID + epoch? And this is true, and from here there is an interesting consequence: in hotspot, a call to System.identityHashCode will result in a revoke bias object.

Further, when the monitor is busy, a mark is stored in the mark word to the place where the real mark word is stored . In the stack of each thread there are several "sections" in which different things are stored. We are interested in the one where the lock record is stored. This is where we copy the mark word of the object for lightweight locking. Because, by the way, thin-locked objects are called stack locked . A swollen monitor can be stored both on the thread that inflated it and in the global pool of thick monitors.

It's time to go to the code.

A simple example of using `synchronized`

Let's start with this class:

 1 2 3 4 5 6 7

 public class SynchronizedSample { void doSomething() { synchronized(this) { // Do something } } }

and see what it compiles into:

 javac SynchronizedSample.java && javap -c SynchronizedSample

I will not give a full listing, but I will only manage with the body of the doSomething method, providing it with comments.

 void doSomething(); Code: 0: aload_0 //    this 1: dup //    (this) 2: astore_1 //      (this)   1 3: monitorenter //   ,     (this) 4: aload_1 //   this   5: monitorexit //   6: goto 14 //  "catch-" 9: astore_2 // (  ,   )     2 10: aload_1 //  this    11: monitorexit //   12: aload_2 //      13: athrow //   14: return //  Exception table: from to target type 4 6 9 any //      ,   "catch-" 9 12 9 any //   ,

Here we are interested in the instructions monitorenter and monitorexit . You can, of course, search for what they do in the ~~Yandex~~ search system at your discretion, but this is fraught with misinformation, and not in a patzanski way. In addition, I have the OpenJDK sources at hand, which you can have fun with at all . In these source codes it is quite easy to see what happens to the byte code in the interpretation mode. There is only one caveat: Lesha TheShade Shipilev said that

In general, the VM helper code for some action may differ in content from that pasted by JIT. Up to the point that some optimizations by JIT can simply not be ported to the interpreter.

Lesha also recommended PrintAssembly in the teeth and look at the compiled and JIT-compiled code right away, but I decided to start from a path of less resistance, and then see how it really is.

monitorenter

The interpreter sources are in the hotspot/src/share/vm/interpreter daddy, and there are a lot of them. Rereading everything at this stage is not very appropriate, therefore with the help of grep we will find places in which, probably, what we need happens. First of all, it’s worth looking at the ads in bytecodes.hpp and bytecodes.cpp :

 ./bytecodes.hpp:235: _monitorenter = 194, // 0xc2 ./bytecodes.cpp:489: def(_monitorenter, "monitorenter", "b", NULL, T_VOID, -1, true);

As you can easily guess, the human enum constant for bytecode .hpp is defined in .hpp , and this operation is registered in the .cpp using the def method. Talking about it separately in this article makes no sense: it is enough to clarify that the monitorenter command is monitorenter , which is a single byte-code without parameters ( b ), returns nothing, pulls one value from the stack and can provoke a lock or a safepoint call ( about the latter later).

The following is of interest file bytecodeInterpreter.cpp . It has a wonderful method BytecodeInterpreter::run(interpreterState istate) , which takes only about 2,200 lines, and generally rotates in a loop until the body of the method being processed ends. (In fact, another large piece deals with other useful things such as method initialization, locking, if the method is synchronized , and so on). Finally, starting at line 1667 , what happens when a monitorenter operation is monitorenter . First of all, there is a free monitor on the stream stack (if there are none, then it is requested from the interpreter using istate->set_msg(more_monitors) ), and an unlocked copy of the mark word is placed there. After that, using CAS, we try to write a pointer to this copy in the mark word of the object, which is called the displaced header .

CAS - Compare-and-swap - atomically compares *dest and compare_value , and if they are equal, *dest and exchange_value places. The initial value *dest returned. (This is guaranteed bilateral membar, but about them in the next article)

If CAS is a success, then the victory (and with it the monitor) is ours, and this can be completed (the tag is contained in the pointer to the displaced header itself - optimization). If not, then go ahead, but first pay attention to an important point: we didn’t check if this monitor wasn’t biased . Remembering the warnings of Leshina, we understand that we encountered optimization, which did not reach the interpreter. By the way, when processing synchronized methods everything is checked normally, but it will be a little later.

If CAS did not succeed, then we check whether we are already owners of the monitor (recursive capture); and if so, then success is with us again, the only thing we are doing is writing to the displaced header on our NULL stack (we will see later why this is necessary). Otherwise, we make the following call:

 CALL_VM(InterpreterRuntime::monitorenter(THREAD, entry), handle_exception);

The CALL_VM macro CALL_VM all sorts of technical operations such as creating frames and performs the transferred function. In our case, this function is InterpreterRuntime::monitorenter , which is located in the new interpreterRuntime.cpp file. The monitorenter method, depending on whether UseBiasedLocking is UseBiasedLocking , calls either ObjectSynchronizer::fast_enter or ObjectSynchronizer::slow_enter . Let's start with the first.

fast_enter

First a couple of words about the feasibility of such optimization. Some Tokyo scientists from IBM Research Labs calculated statistics by some unknown method and found that in most cases synchronization is uncontended. Moreover, an even stronger statement was proposed: the monitors of most objects throughout their life are captured by only one stream. And so the idea of biased locking was born: the one who stood up first, the one and the slippers. You can read more about biased locking, for example, on these slides (although they are slightly out of date), and we will return to the hotspot sources. Now we are interested in the src/share/vm/runtime/synchronizer.cpp file, starting at line 169. First we should try to do a rebias for ourselves, and if it does not work out - do a revoke and go to the usual thin slow_enter. Optimistic attempts occur in the BiasedLocking::revoke_and_rebias , which is in the biasedLocking.cpp file. We describe them in more detail:

Rebias is a flow change that reserves the right to quickly capture a monitor. It is permissible in simple cases: for example, when the monitor is actually still someone did not have time to capture ( anonymously biased ), or if bias has been preserved since the previous era (the number of bulk revocation for a particular class, see below), which is essentially similar the fact that the monitor is not over-bias-il. In addition, rebiasing can be banned completely by calling fast_enter ( attempt_rebias = false ).
Revoke - disable biased locking for the monitor. Used in most cases, especially when attempt_rebias set to false . If the monitor is now owned by another live stream, then rebias will require this stream to stop at safepoint , after which it will run over the stack of this stream and fix the monitor header stored there on unbiased.
Bulk rebias - rebias of all living and later allocated instances of this class. It is executed when the number of revocations or rebiasing operations for a given class during this epoch exceeds the BiasedLockingBulkRebiasThreshold . Provokes a change of epoch, runs at a global safepoint.
Bulk revoke - revoke all live and later instances of this class. It is executed when the number of revocations or rebiasing for this class during this era exceeds the BiasedLockingBulkRevokeThreshold . Provokes a change of epoch, runs at a global safepoint.

Let me remind those who knew, but forgot ^tm , what is a safepoint:

safepoint is a state of the virtual machine in which the execution of threads is stopped in safe places. This allows for intrusive operations, such as revoke bias at the monitor, which the stream currently owns, deoptimizing or taking the thread dump.

In our case, the attempt_rebias parameter is always true , but sometimes it may turn out to be false : for example, in the case when the call comes from VM Thread.

As you can guess, bulk operations are tricky optimizations that make it easy to transfer a large number of objects between threads. If it were not for this optimization, it would be dangerous to include UseBiasedLocking by default, since then a large class of applications would always be engaged in revocations and rebiasing.

If the fast path to capture the stream failed (that is, revoke bias was made), we proceed to capture the thin-lock.

slow_enter

The method that interests us now is in the src/share/vm/runtime/synchronizer.cpp file. Here we have several options for the development of events.

Case one, “good” : the object monitor is currently free. Then we try to take it with the help of CAS. CAS is already architecture-dependent (and, as a rule, natively supported by the processor) instruction, so there is no special meaning to go deeper. (Although Ruslan # let me know the cheremin and was outraged that I "did not get to the running of electrons in the sources-drains of MOS transistors") . If CAS succeeds, then victory: the monitor is captured. If it fails, we are filled with sadness and proceed to the third case.
The second case is also “good” : the object monitor is not free, but is occupied by the same stream that is now trying to capture it. This is a recursive capture. In this case, for simplicity, we write to the displaced header in our NULL stack, because we previously captured this monitor and already have information about it in our stack.

The third case, the “bad” inflate : So, all attempts to be cunning with the collapse failed, and we have nothing to do but to sulk, to show the vile user the language and fall off the segfault. Haha Joke. However, I didn’t just say the same about inflating: in the case when you have to act “honestly”, resorting to OS-level primitives, they say that monitor was inflated . In the code, this behavior is described in the ObjectSynchronizer::inflate , where we will not look particularly carefully: in fact, the method is thread safe and taking into account some technical subtleties sets the monitor a flag that it is bloated.

After inflating the monitor it is necessary to enter it. The ObjectMonitor::enter method does just that, it uses every conceivable and inconceivable trick to avoid parking the stream. These tricks include, as you might have guessed, attempts to capture using a spin loop, using one-off CAS and other “free methods”. By the way, it seems I have found a slight discrepancy between comments and what is happening. once we try to enter the spin loop monitor, claiming that we do it only once:

 352 353 354 355 356 357 358 359 360 361 362 363

 // Try one round of spinning *before* enqueueing Self // and before going through the awkward and expensive state // transitions. The following spin is strictly optional ... // Note that if we acquire the monitor from an initial spin // we forgo posting JVMTI events and firing DTRACE probes. if (Knob_SpinEarly && TrySpin (Self) > 0) { assert (_owner == Self , "invariant") ; assert (_recursions == 0 , "invariant") ; assert (((oop)(object()))->mark() == markOopDesc::encode(this), "invariant") ; Self->_Stalled = 0 ; return ; }

But a little further, in the called enterI method, enterI do it again, again speaking of only once:

 485 486 487 488 489 490 491 492 493 494 495 496 497

 // We try one round of spinning *before* enqueueing Self. // // If the _owner is ready but OFFPROC we could use a YieldTo() // operation to donate the remainder of this thread's quantum // to the owner. This has subtle but beneficial affinity // effects. if (TrySpin (Self) > 0) { assert (_owner == Self , "invariant") ; assert (_succ != Self , "invariant") ; assert (_Responsible != Self , "invariant") ; return ; }

Hmm, parking at the operating system level is so scary that we are ready for almost everything to avoid it. Let's see what is so terrible in her.

Parking flows in reverse

I must note that we have now come to the code that was written a long time ago, and this is noticeable. There are a lot of duplication, reengineering and other amenities. However, the presence of comments like “remove this crutch” and “combine these with that” is slightly reassuring.

So what is parking streams? Everyone probably heard that each monitor has a so-called Entry List (not to be confused with the Waitset ) So: he really does have it, although it is actually a queue. After all the failed attempts to enter the monitor cheaply, we add ourselves to this queue, after which we park:

 591 592 593 594 595 596 597 598 599 600 601

 // park self if (_Responsible == Self || (SyncFlags & 1)) { TEVENT (Inflated enter - park TIMED) ; Self->_ParkEvent->park ((jlong) RecheckInterval) ; // Increase the RecheckInterval, but clamp the value. RecheckInterval *= 8 ; if (RecheckInterval > 1000) RecheckInterval = 1000 ; } else { TEVENT (Inflated enter - park UNTIMED) ; Self->_ParkEvent->park() ; }

Before going directly to the parking lot, note that it may be timed or not timed, depending on whether the current thread is responsible. Responsible flows are always no more than one, and they are needed in order to avoid the so-called stranding: a sadness when the monitor is free, but all the threads in the wait set are still parked and waiting for a miracle. When there is a person in charge, he automatically wakes up from time to time (the more time a futile wakeup occurs , an awakening, after which the lock cannot be captured - the longer the parking time. Note that it does not exceed 1000 ms) and tries to enter the monitor. The rest of the streams can wait for awakening for even ages.

Now it is time to go to the very essence of parking. As you already understood, this is something semantically similar to wait/notify familiar to every java-developer, but it happens at the operating system level. For example, in linux and bsd, as one would expect, POSIX threads are used, in which pthread_cond_timedwait (or pthread_cond_wait ) are called to wait for the release of the monitor. These methods change the status of the Linux flow to WAITING and ask the system sheduler to wake them up when some event occurs (but not later than after a certain period of time, if the flow is responsible).

Thread scheduling at OS level

Well, it's time to get into the linux kernel and see how the scheduler works there. The linux sources are known to be in git, and you can clone the scheduler like this:

 git clone git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-rt.git

Let's go to the folder ... Okay, okay, just kidding. The features of a sheduler in Linux up to the lines in the source code are already too deep for a harmless article, which, let me remind, began with a simple synchronized block :) In order to reduce hardcore, I will tell you how sheaders work in general; both in linux and in alternative systems. - , , : — , . — quantum — , , , . linux , 10-200 , 1 . windows , 2-15 , — 10 15 .

, , , . , , (, - I/O-), , - . - . , , , -, , .

, , , — . , , , , , , .

, , : , — , . : , , -, .

, , , . , contention, .

monitorexit

biased locking , -, . , displaced header NULL, . : biased lock IllegalMonitorStateException ( -).

unbiased locking InterpreterRuntime::monitorexit . (, : IllegalMonitorStateException ) ObjectSynchronizer::slow_exit , , fast_exit . , fast path . : stack-locked , inflating inflated . : , , . , - .

, , - , . , , , TrySpin (. ). , . , , , .

, , ( Knob_QMode . , , , 0 , . , , , , , ), , . , , , . - os::PlatformEvent::unpark() , . , linux bsd pthread_cond_signal .

, , , .

NB: `synchronized` -

java- :

 synchronized void doSomething() { // Do something }

- :

 synchronized void doSomething(); Code: 0: return

bytecodeInterpreter.cpp synchronized - , 767 . if (METHOD->is_synchronized()) . if -, biased locking. , monitorenter . , , , ( CAS) biased -.

, , synchronized.

Wait notify

synchronizer.cpp , 377 . wait/notify inflated, , — inflate' . wait notify .

wait set ( ) , ( , ; - notify).

Notify wait set , , - , . NotifyAll , wait set .

Memory effects

, JMM, , happens-before . thin CAS-; inflated OrderAccess::fence(); . biased, , : program order . revoke HB monitorexit, , ( thin, inflated), enter ( thin, inflated).

wait fence, HB.

. aka Disclaimer

, , . , JIT . . , , « hotspot» , .

Stay tuned

, , memory barriers , happens-before JMM. volatile , . final- , TheShade cheremin , ( ). , , PrintAssembly , , JIT.

: 144f8a1a43cb jdk7u. , — ..

Biased locking , BiasedLockingStartupDelay (4000 ). , , safepoints, revoke bias .

safepoint ObjectSynchronizer::deflate_idle_monitors , , .

TheShade , artyushov , cheremin AlexeyTokar , (|) , , - , .

Source: https://habr.com/ru/post/143237/

All Articles

But how does multithreading work? Part I: Synchronization

Theoretical minimum

Object Headers

Content mark words

A simple example of using `synchronized`

monitorenter

fast_enter

slow_enter

Parking flows in reverse

Thread scheduling at OS level

monitorexit

NB: `synchronized` -

Wait notify

Memory effects

. aka Disclaimer

Stay tuned

And one more thing ©

More articles:

But how does multithreading work? Part I: Synchronization

Theoretical minimum

Object Headers

Content mark words

A simple example of using synchronized

monitorenter

fast_enter

slow_enter

Parking flows in reverse

Thread scheduling at OS level

monitorexit

NB: synchronized -

Wait notify

Memory effects

. aka Disclaimer

Stay tuned

And one more thing ©

More articles:

A simple example of using `synchronized`

NB: `synchronized` -