Computing Graphs, Speculative Locks and Arenas for Tasks at Intel® Threading Building Blocks (continued)

This post is a continuation of the translation of the article “Flow Graphs, Speculative Locks, and Task Arenas in Intel Threading Building Blocks” from Parallel Universe Magazine, Issue 18, 2014. In this half of the article we will look at speculative locks that take advantage of Intel technology Transactional Synchronization Extensions and user-managed arenas for tasks (user-managed tasks arenas) that provide enhanced control and management of the level of concurrency and task isolation. If you are interested - welcome under cat.

Speculative Locks

Intel TBB 4.2 offers speculative locks: New synchronization classes based on Intel Transactional Synchronization Extensions (Intel TSX) technology.
Speculative locks can allow critical sections protected by this lock to run simultaneously, making the assumption that data access and modification does not conflict with each other. In practice, if a conflict occurs over the data, one or more speculative execution must be canceled without touching the protected data and, thus, without affecting other flows. Then the threads involved in the data conflict repeat their critical sections and can take a real lock to protect the data (Intel TSX technology does not guarantee that the speculative execution will complete successfully, after all).
In the implementation of speculative locks in the Intel TBB library [6, 7], all these steps occur unnoticed by the user, and the programmer can simply use the special mutex API. Moreover, on a processor that does not support Intel TSX technology, the implementation will immediately use a regular lock. Those. developers can write portable programs that can take advantage of transactional synchronization.
Now the Intel TBB library provides 2 classes of mutexes that support Intel TSX technology: speculative_spin_mutex and speculative_spin_rw_mutex ; the latter was added as a “preview feature” in Intel TBB 4.2 Update 2.
The speculative_spin_mutex class is very similar to the spin_mutex class; and both are located in the same tbb / spin_mutex.h header file. The main difference between speculative_spin_mutex , of course, besides Intel TSX support, is its size. In order to avoid sharing the cache with other data, which with a high probability led to a conflict and loss of performance, the speculative_spin_mutex object instance occupies 2 cache lines.
An example of using a speculative castle:

#include <tbb/spin_mutex.h> #include <set> tbb::speculative_spin_mutex tsx_mtx; std::set<int> g_Set; void thread_safe_add_to_set( int value ) { tbb::speculative_spin_mutex::scoped_lock lock(tsx_mtx); g_Set.insert(value); }

The speculative_spin_rw_mutex class, as you can guess from its name, implements a RW spinlock with a speculative execution. Someone might have noticed that any speculative castle is not exclusive by definition, allowing not only to read at the same time, but also to write if there are no conflicts. That is, “speculative RW-lock” may sound like a tautology. However, it must be recalled that the flow can capture the lock "for real". In this case, speculative_spin_mutex should provide exclusive access to the stream, no matter what the stream does — read or modify data; therefore, it is not a real RW-lock. On the other hand, speculative_spin_mutex allows many data readers to run. Moreover, “real” and speculative readers can be executed simultaneously. But there are additional overheads: the internal data fields of the class must be stored on different cache lines, in order to avoid simultaneous access to the cache lines from several cores (false sharing). Because of this, each instance of the class speculative_spin_rw_mutex occupies three cache lines.
Although speculative_spin_rw_mutex is currently in the tbb / spin_rw_mutex.h header file, and even uses spin_rw_mutex as part of the implementation, these two classes are not fully compatible. There are no lock () or unlock () methods in speculative_spin_rw_mutex . It forces you to use the scope template, ie, it must be accessible through the class speculative_spin_rw_mutex :: scoped_lock . Since this is a preview feature class, the TBB_PREVIEW_SPECULATIVE_SPIN_RW_MUTEX macro must be set to a non-zero value before including this header file.
Unfortunately, the use and benefits of speculative locks is very dependent on the task. Do not think that these new classes are “improved locks”. Careful performance studies are needed in order to decide on a case-by-case basis that speculative locks are the right tool.

User-Managed Task Arenas User-Managed Task Arenas

Another significant part of the new functionality recently added to the library is managed arenas for tasks. In our terminology, an arena is a place for streams, where they share tasks and take tasks for execution. Initially, the library supported one global arena for the application. Then we changed it, allowing it to support different arenas for each application flow, based on user feedback that the work launched to perform by various threads should be isolated from each other. We then received requests so that concurrency control and job isolation were not associated with application threads. To satisfy these requests, you have entered user-controlled arenas for tasks. At the moment, this is still a class for preview (preview feature) and for use it requires setting the TBB_PREVIEW_TASK_ARENA macro to a non-zero value. But we are working to make it fully supported later this year (2014).
The API for user-managed arenas for tasks is implemented through the task_arena class. When setting the task_arena class, the user can specify the desired parallelism and how much of this parallelism should be reserved for the application threads.

 #define TBB_PREVIEW_TASK_ARENA 1 #include <tbb/task_arena.h> tbb::task_arena my_arena(4, 1);

In this example, an arena is created for 4 threads and one place is reserved behind the application stream. This means that up to 3 workflows managed by the Intel TBB library can join this arena and work on the tasks that are in this arena. There is no limit on how many application threads can add work to this arena, but the overall parallelism will be limited to 4; All the "extra" streams will not be able to join the arena and perform tasks there.
In order to send the work to the arena, you must call the execute () or enqueue () methods:

 my_arena.enqueue( a_job_functor ); my_arena.execute( a_job_functor2 );

The work for any of these methods can be represented by lambda expressions from C ++ 11 or a functor. These two methods differ in the method of sending work to the arena. task_arena :: enqueue () is an asynchronous call for sending work like “fire-and-forget” (sent-and-forgotten); the thread that called task_arena :: enqueue () does not join the arena and immediately returns from this call. task_arena :: execute () , on the other hand, does not return until the submitted work is completed; if possible, the thread that called task_arena :: execute () joins the arena and performs its tasks. If not, the thread is blocked until the task is completed.
You can send many consecutive tasks to task_arena , but this is not what it was designed for. Usually, a task is sent to task_arena that creates enough parallelism, for example, calling parallel_for :

 my_arena.execute( [&]{ tbb::parallel_for(0,N,iteration_functor()); });

or flow graph:

 tbb::flow::graph g; ... // create the graph here my_arena.enqueue( [&]{ ... // start graph computations }); ... // do something else my_arena.execute( [&]{ g.wait_for_all(); }); // does not return until the flow graph finishes

More information about the arenas for the tasks can be found in the Intel TBB Reference Manual.

Conclusion

The C ++ Template Library The Intel Threading Building Blocks offers a rich set of components to use efficient high-level, task-based concurrency development and the development of portable applications that can use the full power of multi-core architectures in the future. The library allows application developers to focus on the parallelism of algorithms in an application without having to focus on low-level details on how to manage this parallelism. In addition to highly efficient implementations of the most used high-level parallel algorithms and thread-safe containers, the library provides such low-level building blocks as a thread-safe scalable memory manager, locks, and atomic operations.
Despite the fact that the Intel TBB library is already quite complete and recognized by the community, we continue to improve its performance and expand its functionality. In the Intel TBB 4.0 version, we released a computational graph support in order to enable developers to more easily implement algorithms that are based on dependency graphs for data or execution. In the release of Intel TBB 4.2, we supported the benefits of Intel Transactional Synchronization Extensions technology with new synchronization classes and responded to user requests for enhanced control and management of the level of parallelism and task isolation by introducing user-controlled task arenas.
You can find the latest versions of Intel TBB and further information on our sites:

Official website software.intel.com/en-us/intel-tbb
Open source version threadingbuildingblocks.org
Documentation software.intel.com/en-us/tbb_4.2_ug
Forum software.intel.com/en-us/forums/intel-threading-building-blocks

Bibliography

[1] Michael McCool, Arch Robison, James Reinders “Structured Parallel Programming” parallelbook.com
[2] Vladimir Polin, “Android * Tutorial: Writing a Multithreaded Application using Intel Threading Building Blocks”. software.intel.com/en-us/android/articles/android-tutorial-writing-a-multithreaded-application-using-intel-threading-building-blocks
[3] Vladimir Polin, “Windows * 8 Tutorial: Writing a Multithreaded Application for the Windows Store * using Intel Threading Building Blocks”. software.intel.com/en-us/blogs/2013/01/14/windows-8-tutorial-writing-a-multithreaded-application-for-the-windows-store-using
[4] Michael J. Voss, “The Intel Threading Building Blocks Flow Graph”,
Dr. Dobb's, October 2011, www.drdobbs.com/tools/the-intel-threading-building-blocks-flow/231900177 .
[5] Aparna Chandramowlishwaran, Kathleen Knobe, and Richard Vuduc, “Performance
Evaluation of Concurrent Collections on High-Performance Multicore Computing Systems ”,
2010 Symposium on Parallel & Distributed Processing (IPDPS), April 2010.
[6] Christopher Huson, “Transactional memory support: the speculative_spin_mutex”. software.intel.com/en-us/blogs/2013/10/07/transactional-memory-support-the-speculative-spin-mutex
[7] Christopher Huson, “Transactional Memory Support: the speculative_spin_rw_mutex”. software.intel.com/en-us/blogs/2014/03/07/transactional-memory-support-the-speculative-spin-rw-mutex-community-preview

Source: https://habr.com/ru/post/228469/

All Articles

Computing Graphs, Speculative Locks and Arenas for Tasks at Intel® Threading Building Blocks (continued)

Speculative Locks

User-Managed Task Arenas User-Managed Task Arenas

Conclusion

More articles: