Transactional Memory: History and Development

Definition

Parallel programming is difficult. When using systems with shared memory, one cannot do without synchronization of access of parallel processes / threads to a shared resource (memory). To do this, use:

locks (mutex);
non-blocking algorithms (lockless, lock-free);
transactional memory.

Transactional memory is a technology for synchronizing concurrent streams. It simplifies parallel programming by separating groups of instructions into atomic transactions. Competitive threads work in parallel ¹ , until they begin to modify the same memory. For example, the operations of adding nodes to a red-black tree (animation in the header) are able to work in parallel in several streams.

Hidden text

/* Move item from one list to another */ int move(list *from, list *to) { __transaction_atomic { node *n = pop(from); push(to, n); } }

The approach to managing competitiveness using transactional memory is called optimistic: we believe that threads work independently of each other and in rare cases change the same data. In this case, most transactions end successfully. In contrast, locking approaches are pessimistic: we assume that threads will always conflict and always prohibit them from being in the critical section at the same time.

If a data conflict occurs, the transaction is canceled. Cancellation of a transaction results in the cancellation of the actions that the thread performed during the transaction. After this, the transaction is usually restarted, or the function previously specified as an “emergency exit” is called, most often a rollback to the use of locks.
')
Advantages of transactional memory:

relative ease of use (concluding whole methods in a transaction block);
complete absence of locks and interlocks;
increased concurrency and, consequently, performance.

Transactional memory is not a silver bullet. There are also disadvantages:

improper use may result in poor performance and incorrect operation;
limited use - in a transaction it is impossible to perform operations that cannot be undone;
debugging complexity - it is impossible to put a breakpoint inside a transaction.

The birth of technology

Transactions in databases have existed for several decades. For the first time, the idea of transferring transactions from the world of databases to the world of parallel programming originated in the 1980s. Developed and popularized technology Maurice Herlihy , Ravi Rajwar , Nir Shavit . The first studies offered hardware implementations of transactional memory, which were not destined to be born for another 30 years.
In the 1990s, the first software implementations of the technology appeared, hardware implementations were pulled up to the 2000s.

Software implementations (Software Transactional Memory, STM)

Among the many implementations of software transactional memory, I would like to highlight four. Examples are available on github: JIghtuse / tm-experiments .

Clojure

Clojure is the only language whose core supports transactional memory. The main constructions of STM are: ref (reference to data, through which data is changed only in a transaction) and dosync (transaction block).

Clojure's STM approach is called MultiVersion Concurrency Control ( MVCC ) concurrency control: it stores multiple logical versions of the data used in a transaction. During a transaction, the stream observes a snapshot of the data at the time it starts.

Hidden text

Link version diagram in a Clojure transaction.

Transactions 1 and 2 begin at the same time, receiving one copy of the version of ref v0 . Inside the transaction, the data is processed, which changes the value of ref . The first transaction ends earlier and wins the race for updating the ref new value. Then the second transaction ends, and its attempt to update ref fails (red arrow in the diagram), since the version of ref was not the one that was expected. In this case, the transaction is restarted, receiving a copy of the new version of ref . Since no other transaction attempts to change the ref , transaction 2 completes successfully the second time.
The calculations in transaction 3 do not change the value of ref , so the restart is not called and it completes successfully.

Consider an example of transferring funds between bank accounts:

Hidden text

The program runs in the same thread, but is thread-safe.

 (def account1 (ref 100)) (def account2 (ref 0)) ;     'ref'  (deref refname): (deref account1) 100 ; @refname -  (deref refname) @account2 0 (defn transfer [amount from to] (dosync (alter from - amount) (alter to + amount))) (transfer 100 account1 account2) 100

There are options for using competition, in which the environment is allowed to "relax" to achieve additional performance. For example, imagine that you keep a transaction log for the day. The order of transactions in the log is not important if you know that the final balance will be correct. If you receive two installments of $ 100 and $ 50, the sequence of their entry in the journal does not matter. The contribution from two transactions is commutative, and clojure provides a competitive commute operation for just such cases.

Hidden text

 (defn log-deposit [account amount] (dosync (println "Depositing $" amount " into account, balance now: " (commute account + amount)))) (def myaccount (ref 0)) (log-deposit myaccount 100) (log-deposit myaccount 50) ; (as good as) equivalent to (log-deposit myaccount 50) (log-deposit myaccount 100)

Haskell

Haskell's transactional memory is contained in the STM library, which is part of the Haskell Platform. Incorrect use of transactional types is determined at the compilation of the program.

Haskell has a powerful type system. The language separates functions with side effects and pure functions. Any value of type IO t is called an action. To perform an action atomically in Haskell, the action is preceded by the word atomically. Calling atomically with action as an argument gives two guarantees:

atomicity - the effect of atomically act will be visible to other threads at once and completely. No other thread is able to see the state inside the atomic action, only the final state.
isolation — during a call to atomically act , act is completely independent of other threads. It removes the state of the world at startup and runs in that state.

atomically has the type atomically :: STM a -> IO a . Action type STM a - argument. Like the IO action, the STM action has side effects, but their range is significantly narrower. Basically, in the STM action, the transaction variable TVar a is written or read:

readTVar :: TVar a -> STM a
writeTVar :: TVar a -> a -> STM ()

Consider an example of transferring funds between bank accounts.

Hidden text

 import System.IO import Control.Concurrent.STM --          TVar   type Account = TVar Int withdraw :: Account -> Int -> STM () withdraw acc amount = do bal <- readTVar acc writeTVar acc (bal - amount) deposit :: Account -> Int -> STM () deposit acc amount = withdraw acc (- amount) transfer :: Account -> Account -> Int -> IO () --  'amount'   'from'   'to' transfer from to amount = atomically (do deposit to amount withdraw from amount) showAccount :: Account -> IO Int showAccount acc = atomically (readTVar acc) main = do from <- atomically (newTVar 200) to <- atomically (newTVar 100) transfer from to 50 v1 <- showAccount from v2 <- showAccount to putStrLn $ (show v1) ++ ", " ++ (show v2) --   "150, 150"

Scala

The STM implementation for Scala (ScalaSTM) was developed under the impression of implementations in Haskell and Clojure. In addition to Scala, ScalaSTM is called from Java and Clojure. The implementation is used in the popular Akka framework for writing parallel programs.
ScalaSTM provides the Ref cell, which is modified exclusively within the transaction. Data structures based on immutable objects and Ref are used by many threads.

Consider the implementation of a doubly linked thread-safe list using transactional memory. Unfortunately, I did not manage to collect an example on Scala, I leave this activity to the reader.

Hidden text

Full github example: github.com/nbronson/scala-stm/blob/master/src/test/scala/scala/concurrent/stm/examples/ConcurrentIntList.scala

To implement a shared structure, pointers to the next and previous node make it thread-safe. If there is a possibility that one thread writes a variable at the same time when another one accesses it (reads or writes), then use Ref . Define the class for the list node and initialize the head of the list. List of roundabouts: when creating pointers of the head list point to it. These pointers are never null .

 import scala.concurrent.stm._ class ConcurrentIntList { private class Node(val elem: Int, prev0: Node, next0: Node) { val isHeader = prev0 == null val prev = Ref(if (isHeader) this else prev0) val next = Ref(if (isHeader) this else next0) } private val header = new Node(-1, null, null)

If x is Ref , then x() gets the value stored in x , and x() = v sets it equal to the value of v .
Ref are read and written only inside the atomic block (transaction). This is checked at compile time. The following demonstrates the use of a transaction.

 def addLast(elem: Int) { atomic { implicit txn => val p = header.prev() val newNode = new Node(elem, p, header) p.next() = newNode header.prev() = newNode } }

Scala combines the ideas of many programming paradigms. It is not surprising that in the field of transactional memory language has original technologies. The aforementioned Akka framework combines the capabilities of actors and transactional memory, getting agents and transactors that will finally blow your mind.

C / C ++ (GCC 4.7+)

Starting from version 4.7, GCC supports transactional memory. The implementation is a libitm runtime library, the -fgnu-tm (-mrtm, -mhle) flag is specified for compilation. The library was developed with an eye to the draft of transactional structures for C ++ (the inclusion of a language in the standard is proposed).

Most implementations of hardware transactional memory use the principle of greatest effort. Therefore, practical implementations use the combination of hardware technology and software transactional memory. Such systems are called “hybrid transactional memory” systems. These include, in particular, the implementation of GCC.

Key words are entered into the language:

__transaction_atomic { … } - an indication that the code block is a transaction;
__transaction_relaxed { … } - an indication that an insecure code inside the block does not lead to side effects;
__transaction_cancel - explicitly cancel the transaction;
attribute((transaction_safe)) - indication of the transaction-safe function;
attribute((transaction_pure)) - an indication of the function without side effects.

To demonstrate the use of transactional memory in C, we will fill in a histogram of data in competitive streams.

Hidden text

Full implementation on github: github.com/JIghtuse/tm-experiments/blob/master/histogram/src/hist.c

One transaction block is used per histogram update cycle. The runtime library (or hardware) will determine when and which transactions to restart.

 #ifdef _USE_TSX __transaction_atomic { #endif for (i = 0; i < d->sz; ++i) { struct pixel p = d->pixels[i]; unsigned int luminance = rY * p.red + gY * p.green + bY * p.blue; #if defined _USE_TSX ++histogram[luminance/BORDER]; #elif defined _USE_MUTEX pthread_mutex_lock(&mut); ++histogram[luminance/BORDER]; pthread_mutex_unlock(&mut); #endif } #ifdef _USE_TSX } #endif

Hardware implementations (Hardware Transactional Memory, HTM)

Only recently have begun to appear hardware implementation of transactional memory.

Sun Rock (SPARC v9) 2007-2008

Rock Microprocessor from Sun Microsystems was the first microprocessor with hardware support for transactional memory. The 64-bit SPARC v9 architecture processor had 16 cores.

In 2007, the company announced technology support. Two new chkpt and commit instructions and one new status register cps were added for the operation of the transactional memory. The chkpt <fail_pc> used to start a transaction, and the commit instruction to terminate it. When canceling a transaction, a transition to <fail_pc> , at which time it was possible to check the cps register to determine the reason for the cancellation. Transactions were interrupted for reasons of data conflicts, TLB misses, interruptions, complex instructions. However, in many cases, the use of transactional memory in the Rock processor gave advantages in synchronization.

In 2008, Sun engineers unveiled the transactional memory interface and Adaptive Transactional Memory Test Platform simulator. After the acquisition of Sun by Oracle Corporation, the Rock project was closed: “This processor had two amazing properties: it was incredibly slow and consumed an enormous amount of energy. It was so hot that they had to put 12 inches of cooling fans on top so that the processor would not overheat. It would be folly to continue this project. ” ²

IBM BlueGene / Q (PowerPC A2) 2011

The BlueGene / Q supercomputer Sequoia became the first commercial system with hardware support for transactional memory. The technology works in a 32-megabyte cache of the second level of the PowerPC A2 processor (PowerPC BQC 16C). The data in the cache is labeled “version”, and the cache is capable of storing multiple versions of the same data.

The program informs the processor about the start of the transaction, does its work and reports that the transaction needs to be completed (commit). If other threads have changed the same data — by creating versions — the cache rejects the transaction and the thread tries to conduct it again. If no other versions of the data appear, the data is modified and the transaction is completed successfully.

VersionPC technology in PowerPC A2 is also used for speculative execution. Instead of waiting for a new version of the data, the stream can perform calculations with the available data, speculatively doing useful work. If the data was the same as after the update, the thread completes (commit) the work from the cache. Performance is higher: the thread performed the work until the final value was obtained. Otherwise, the results of speculative work are rejected and the flow makes calculations with the correct values.

Supporting transactional memory is in some way a logical extension of a feature that has long been present in PowerPC processors — the load-link / store-conditional, or LL / SC. LL / SC is a primitive operation that can be used as a building block for all thread-safe structures. The first part of LL / SC, load-link, is used by the program to get data from memory. Next, the stream modifies the data and writes it back to memory using the store-conditional. The command succeeds if the data has not changed. Otherwise, the program repeats actions with data from the beginning.

Transactional memory - LL / SC on steroids: threads perform LL operations on many different memory areas, and the transaction completion operation atomically changes all memory areas or cancels the transaction (like SC).

Ruud Haring, who presented IBM's work on transactional memory at Hot Chips , admitted that the company faced many nontrivial problems in its implementation. For all the complexity, the implementation has limitations: it does not provide interprocessor transaction support. The problem is not relevant for Sequoia and at the same time allows us to estimate the gain from the use of transactional memory.

IBM zEnterprise EC12 (Z / Architecture) 2012

The first commercial server supporting transactional memory instructions. When using transactional memory, IBM found a performance increase of 45% compared to the z196 machine in the DB2 database and in virtualized server workloads.

IBM Power8 (Power) 2013

Power 8 memory controllers support transactional memory. Technology support in the Linux kernel appeared since kernel version 3.9.
Information on the HTM in Power8 was not found so much, I hope it still appears.

Intel Haswell (x86) 2013

In 2012, Intel announced the introduction of Transactional Syncrhonization Extensions (TSX) extensions to the x86 instruction set. Extensions allow programmers to mark individual sections of code as transactions.
In 2013, with the release of the Haswell processor generation, hardware support for transactional memory for the first time becomes available at the consumer level.

Haswell manages read and write sets with cache line granularity, tracking L1 data cache addresses. Conflicts are determined using the cache coherence protocol. Which is logical to assume, since the tasks of determining transaction conflicts and maintaining coherence of caches are very close: if the value changes in one thread, but not in others, then something is wrong.

TSX consists of two parts. Hardware lock elimination (HLE) provides a simple conversion of lock-based programs into transactional programs, and the resulting programs will be backward compatible with existing processors. Restricted Transactional Memory (RTM) is a more complete implementation of transactional memory.

Hardware Lock Elision

HLE slightly modifies the instructions for changing the memory location. The technology adds prefixes — instructions that do not produce any actions, but change the interpretation of the next instruction — to modify the instructions for taking and releasing the lock (XACQUIRE and XRELEASE, respectively).

Between taking and releasing a lock, the processor keeps track of the chunks that read and write streams. In the event of a conflict — if two streams modify one data, or one stream reads data that another writes — the processor cancels the transaction when the lock is released. Otherwise, execution continues normally.

Thus, HLE allows the generally accepted lock-based code to work optimistically. Each thread will assume that it has received a lock, while other threads can operate simultaneously. As long as it is safe, transactions are not canceled.

The technology is backward compatible with non-HTM processors. Lock management operations remain, but with a special prefix. Haswell processors will take into account the prefix and use transactional execution instead of blocking manipulations. Any other processor will ignore the prefix and simply control the lock using traditional lock-based behavior. The XACQUIRE and XRELEASE instructions already exist, but do not carry any meaning until they are used with specific instructions.

HLE allows you to write programs and operating systems that will use Haswell transactions, increasing non-blocking concurrency. The code will work correctly on current processors. As a result, entering a new feature will be easy and safe.

HLE by a simple example

Operating systems implement locks in the kernel as chunks of memory, usually using the type typical for a processor. A lock taker does something with this memory location, for example, it increases the value from 0 to 1. The reverse operation is applied to release the lock (for example, decrement from 1 to 0). Changes are visible to each processor core and each thread in the system, so other threads immediately determine that they cannot take a lock and must wait for it to be released (acquiring the value 0).

With the XACQUIRE / XRELEASE prefixes, an attempt to take a lock is completed successfully, and the process assumes that the lock belongs to it and works further. . , , 1, - 0. , . , .

HLE: 0 1 , 0. “” .

Restricted Transactional Memory

RTM requires more participation: it moves away from backward compatibility and introduces three new instructions. While HLE implicitly uses transactions, allowing lock-based code to work in parallel, RTM makes the start, end, and interruption of transactions explicit. The thread starts a transaction with an XBEGIN instruction, providing a “fallback” function that starts when the transaction is interrupted. The transaction is completed with the XEND instruction, and the processor conducts the transaction if there are no conflicts or interrupts it, switching to the spare function. Transactions are explicitly aborted in the program by the XABORT instruction.
Thanks to the explicit use of boundaries and “escape route”, RTM allows you to more fully control transactions than HLE. In the long term, RTM will simplify the implementation of transactional capabilities.

findings

The use of transactional memory is a viable alternative to existing synchronization methods, which simplifies parallel programming.
For inaccuracies in translation, stylistics, and typos, please write in a personal message.

Notes

On the difference between the concepts of "competitiveness" and "parallelism", Rob Pike correctly put it in his speech " Concurrency is not Parallelism ": "Competitiveness - working with many things at once, parallelism - doing many things at once"
Yes, in addition to a pack of excellent software products (OpenSolaris, MySQL, OpenOffice), Oracle abandoned promising hardware technology

Sources

Source: https://habr.com/ru/post/221667/

All Articles

Transactional Memory: History and Development

Definition

The birth of technology

Software implementations (Software Transactional Memory, STM)

Clojure

Haskell

Scala

C / C ++ (GCC 4.7+)

Hardware implementations (Hardware Transactional Memory, HTM)

Sun Rock (SPARC v9) 2007-2008

IBM BlueGene / Q (PowerPC A2) 2011

IBM zEnterprise EC12 (Z / Architecture) 2012

IBM Power8 (Power) 2013

Intel Haswell (x86) 2013

Hardware Lock Elision

Restricted Transactional Memory

findings

Notes

Sources

More articles: