Pictured: Blue Gene / P at Argonne National LaboratoryDespite the fact that parallel programming, as a discipline, has existed for quite a long time and there are more than one computing core in almost every computer today, I would not call the situation that has developed over the years “revolutionary” or even “evolutionary”. Multi-threading is still a lot of problems and one of the most striking is the difficulty of synchronizing threads with a large number of parallel computing processes.
At IBM (as you thought) they are looking for a solution to this problem (one of my acquaintances described it as “prisoners of blocking” with humor), since the blue giant remains the largest designer, manufacturer and supplier of supercomputer clusters, where synchronized memory access is a stone sticking.
')
There is some good news: the processors of the next Blue Gene / Q model, which will power the 20-petaflops supercomputer Sequoia, which the company is currently building for the
Livermore National Laboratory , support transactional memory not at the
software level, but at the hardware level. With successful testing, this technology will prove that scalable parallel programming can be a simple task (and in the absence of parallel algorithms) - this, in turn, will change the computing landscape. Since most of the research to this day was carried out precisely in the field of STM implementation at the software level, BlueGene / Q chips will make it possible to realistically assess the difference in the speed of operation of two fundamentally different architectures: HTM (hardware transactional memory) and traditional STM.
BlueGene / Q is a multi-core, 64-bit system on a chip, built using PowerPC technology (to be absolutely specific, this is a four-stroke PowerPC A2 architecture). Each of the chips contains 18 cores, together gaining weight in almost half a billion (1.47) transistors. 16 cores are used for, in fact, computations, the operating system runs on one, and finally the last core is responsible for the reliability of the computations of the entire system. At a frequency of 1.6 GHz, each chip is capable of producing 204.8 Gflops under a voltage of 55 watts. Naturally, part of the chip is the memory and I / O controllers. Blue Gene / Q contains 4 devices for calculating floating-point numbers, which gives us 4 completed operations per clock on each core.
Why 18 cores? The answer is: safety is abundant. If a failure was detected on one of the processor cores, it can be disabled and transferred to the "spare" bench. Actually, the detection and reconfiguration of the “erroneous” core can be carried out at any stage of production or assembly of the system - not only when the chip is already being tested, but also at early stages, for example, installing the chip in a computing cluster. In the case of Sequoia, about 100,000 chips will be used in order to achieve the coveted 20 petaflops. The huge number of processors makes the task of reassigning the cores very important: IBM has calculated that with a given (100k) number of chips in a supercomputer, every 3 weeks on average 1 processor unit will fail.

Traditional multithreading: locks and deadlocks
Software transactional memory is an approach to parallel programming familiar to every self-respecting developer. Potentially, the implementation of this technology is designed to make parallel computing more efficient, but more importantly, it is to simplify the paradigm itself. Programming for parallel conditions is relatively simple when the task can be split into independent threads that do not have shared data: each thread is processed on its own core and there is no need for coordination between the cores. But things get much more involved if the tasks are not independent, for example, if a single value update is needed for different processes, which is a common one.
The standard solution to this problem is the use of locks, or transactional locks. Every time a thread needs to update some common value, it puts a lock on it. No other stream can access the blocked value until the current thread finishes the procedure - you have to smoke and wait. The flow that has blocked the value can change it (in this case, comrades “complex calculations” and “wait, I work” take the stage), and then release the lock. Finally, other threads that are in the unlock standby state may continue calculations after the value has been updated or released. This model also works in practice, but it has a number of problems: if references to the general meaning are rare and inconstant, the model works effectively. Otherwise, when access to a common value or its updates occur regularly, but God forbid, constantly, streams spend a lot of time waiting and cannot perform calculations, which ultimately leads to a drop in the performance of the entire system, which would seem to be parallel utterly
In addition, the lock is quite difficult to handle. And if the only common value is to cope simply, then real programs are usually much more complicated. The program with two locks, conditionally: A and B, can enter a position called “deadlock” - deadly blocking (hereinafter: deadlock). If two threads need two locks, then they have the following choice: they can either block A in turn and then B, or vice versa. As long as the threads lock in the same order, there will be no problems. But if one of the threads blocks A first, and the second follows B, they both can get stuck forever - the first thread will wait until the lock is removed from B, and the second waits for the same from A.
On paper, it seems that this is easily avoided - this is true if the program has only two locks. But as soon as the functional starts to get complicated, you will have to be extremely attentive to the fact that each part of the software does its job “as it should.” Add a couple more locks - and the whole scheme becomes more complicated at times, add 10 of them - and you yourself got into deadlock.
True Transactional Memory - End of Locks
In general, the hardware transactional memory currently being developed by IBM is designed to solve the above difficulties. The developer should only mark up the parts of the software that perform manipulations on common values, like “atomic”. Each atomic block is executed inside a transaction, either entirely or in any way. Inside an atomic block, the program can read the total values ​​without blocking, making the necessary calculations and writing the result. At the end, the block simply completes the transaction. Here is the “tricky” part of this entire computational kitchen: when performing an operation, transactional memory checks whether the total values ​​have changed since the beginning of the atomic operation. If yes - the transaction is canceled and all the work done is simply rolled back to the initial stage. For us, this means simply repeating the operation with the changed values.
Transactional memory, potentially, can make life much easier for programmers by proposing several improvements to the traditional blocking scheme. First, it is “optimistic”: instead of each thread blocking values ​​simply because it may require a competitive operation (pessimistic approach), it is assumed that each operation in each stream will be completed successfully (optimistic approach). If such a situation does occur, the only thing that needs to be done by the thread is to repeat the calculation. Secondly, there is no deadlock script, since there are no locks in principle. Third, the programming model using hardware transactional memory, in a broad sense, is already known to most developers. Transactions by themselves, and rollback operations, are known to most programmers who encounter relational databases. Fourth, atomic blocks, theoretically, simplify the task of building large and complex programs: an atomic block with nested atomic blocks will still perform the task that is not true in the case of programs built on blocking principles.
(However, it is worth interrupting your own enthusiastic narration by saying that transactional memory has several bottlenecks in its own architecture - for example, if a transaction needs to perform an operation that cannot be returned to its original state? For example, send data over the network, or draw on the screen, although the latter is a standard limitation of transactional memory in general. The best ways to solve the arising difficulties and to eliminate bottlenecks are still being developed (not just IBM)

Iron benefits
So, until today, research in the field of transactional memory has focused on software implementations of technology. Real (silicon) processors still do not really support it, only emulating it. Some schemes make virtual machines work in transactional memory mode (STM modifications for .NET and Java), others use native code and train developers in special functions to access shared data. Why, then, do we need a hardware implementation? The fact is that any software implementation of transactional memory is slow, sometimes - monstrously slow, but this is not exactly what is expected from parallel computing.
BlueGene / Q differs from other processors in that it carries a transaction memory module on board. This is the first such commercial processor (although Rock was still from Sun - it was canceled when Oracle bought out the first company, it was also supposed to introduce a module of transactional memory). And although we are not yet ready to provide complete information on the chip, we will still reveal some details.
The data in the cache will carry the version tag, while the cache is able to store several versions of the same data. The program tells the processor to start a transaction, performs the necessary calculations, after which the processor closes the transaction. If other threads performed operations on common values ​​and they changed, creating new numerous versions, the cache returns execution to the beginning of the transaction. If new versions were not created, the data will be recorded.
Also, the concept of “speculative operation” is introduced: instead of waiting for the latest versions of all values ​​(in practice, this may require completion of work on numbers in one of the cores), the flow will try to perform the operation with the available data, “speculatively” performing useful work and not knowing what will happen in the future. As soon as the work is done, the system checks the versions of the values, if they were the last (that is, no changes were made to the common values ​​during the operation), the result is written to the memory. This is intended to improve performance, since useful calculations will be performed before the final value is obtained (from the realm of fiction). If it turns out that the common values ​​in the stream are outdated, then such work will be canceled and redone with the updated values.
Logical evolution
In principle, processor support for transactional memory is a logical extension of a function that has long been present in PowerPC processors: “load-link / store-conditional” or LL / SC. This is a primitive operation that can be used as a unit for the construction of safe multithreaded structures. LL / SC includes well-known mechanisms, such as locks, and much more exotic data structures, such as a table, which can be modified by several threads simultaneously without the need to block values. STM (software memory implementation) can be made on the basis of LL / SC.
LL / SC consists of two parts, where the first is a “load-link”. The program uses LL to retrieve a value from memory, after which it can perform the required manipulations on the extracted value. After the operation is completed and the result must be written back into memory, the second part is used - the “store-conditional”. SC will be executed only if the value in memory has not been changed since the end of LL. If the value has been changed, everything starts from the beginning.
LL / SC can be found on many modern processors: PowerPC, MIPS, ARM and Alpha - all four use this feature. The x86 architecture has its alternative called “compare and swap”. However, most LL / SC systems limit your capabilities — for example, they cannot read changes to individual bytes of memory, but only cache lines. This means that the SC operation may not be executed at all, although the value has not actually been changed.
The module of transactional memory that will be built into BlueGene / Q is a hardware implementation of the LL / SC function after a steroid course: each stream in a transaction can perform the LL part of an operation from several places at the same time, just like SC can record the result in several locations This does not deprive the flow of a chance for error, but reduces the likelihood of its occurrence.
Give two!
In general, everything related to the implementation of hardware transactional memory is not an easy story. Ruud Haring, who represented IBM at the last
Hot Chips conference, said that: “it took a lot of clever steps to implement” in order to make such a system work, and that this is “the result of the work of a genius”. After the preliminary work on chip mapping was completed, the system was based on FPGA (processors that are reconfigured via software) and, surprisingly, everything worked fine. But how complicated this system is, there are so many limitations in it - no matter how funny it is, but so far the prototype of transactional memory does not provide support for multiprocessor transactions (everything described above is squared). This is not a problem for Sequoia, which will handle specific tasks, but it will definitely become an obstacle to the popularization of HTM for traditional multiprocessor machines: threads running on different processors will make changes to common values, and transactional memory will not be able to "see."
Still, the main feature of BlueGene / Q and its hardware support for transactional memory is that by implementing this technology at IBM they were able to achieve minimal, or zero, performance loss. This is what makes me believe that once HTM can be used in processors that run on each of us and work efficiently in the applications that we run every day. Until then, we will not have a chance to find out whether such parallelization can really improve the lives of both developers and users.
I will sum up. Hardware Transactional Memory is by no means a panacea and is not a mechanism for increasing processor performance. Rather, it is a tool designed to improve the lives of programmers writing programs for parallel computing of any complexity. And the first test of its viability in a Sequoia cluster is a good test that will show how realistic the use of transactional memory is in real life, and not only when predicting atomic or meteorological relationships. If the test passes, there is no doubt that this development will find its place in commercial devices of the widest profile, since multi-core systems are already present even in the phones. If not, then no one will ever convince me that multi-core, recently considered a “cure” for the problems of increasing productivity and natural limitations in trying to catch up with Moore’s law, can receive a fatal stab in the back.